結論 - 在Java處理器上字串操作之效能分析與優化

本論文為了找出 Java 程式中，字串處理時之效能瓶頸之處，我們提出了 method profiler 及 bytecode profiler。這兩個剖析器可以在不改變程式執行行為與執行時間之下，取得 method 之 invocation 與 bytecode 執行次數與執行時間資訊。

為了增加 heap 的使用效率，我們設計並且實作 on-chip 的 Java heap management unit 來增進每次 heap 空間之分配時間，並搭配 heap cache 減少記憶體存取時間。

為了可以讓 JAIP 可以對硬體加速電路進行操控，我們更設計了 Hardware Native Interface，藉著 native method 之宣告及 cross reference table 等修改，可以讓特定 Java method 以電路邏輯實作。字串操作加速方面，我們找到兩個基本且常用的操作，分別是 arraycopy 與 indexOf，並分別實作加速電路。根據 benchmark 及實際例子測試，硬體加速器的效果十分顯著。

利用本論文設計的 method profiler 及 bytecode profiler，未來 JAIP 可以尋找各種效能的瓶頸。除了可以得知每個 method 和 bytecode 的使用狀況外，更可以從 4 個計數器得知 BEE 執行時間、動態解析時間、中段時間及 heap 存取時間。以 StringAtom 程式來說，從 method profile 觀察頂層 method，動態解析時間佔了約 26.1%整體時間。從 bytecode profile 來看，field 操作、invocation 及 return bytecode 占了 79.8%的整體時間，因此，接下來的優化方向可以從符號解析之流程下手，

例如在迴圈內部執行一個 getfield 動作，每次皆重複進行一樣的符號解析流程，

若是可以直接將 field 之值暫存在 local variable，可以節省不少重複且冗長的動作；

同樣地，invocation 過程中解析的資訊，也可以暫存在特定記憶元件，在尚未執行 return 或下一個 invocation 之前，每次執行同一個 invocation 可以使用自定義的快速版 invocation 之 bytecode。

本篇論文提出的 Hardware Native Interface，可以讓 Java 程式之參數值透過 native method 之呼叫傳送到 hardware device 上，更可以將 hardware device 之計算

結果傳遞到 Java stack 上。雖然本篇論文之字串加速器為了效能考量裝置在 JAIP 之 IP 上，而不是掛載到 bus 上並使用 I/O port 之直接存取，但有了這個介面，

memory mapped I/O 之實作已不是什麼難題。例如我們可以將欲操作的記憶體位址、存取請求種類和模式等等 bus 信號，透過 Hardware Native Interface 傳遞到 JAIP 的 external memory access logic，並將此邏輯電路以 Java native method 封裝，如此一來，Java programmer 便可透過 method invocation 使用 memory mapped I/O 之語言特性。所有 I/O device 存取皆可以在不用透過 RISC-core 的 co-processing 條件下，

JAIP 皆可以使用 Java 程式碼，搭配 memory mapped I/O 特性直接存取任何硬體 device。除此之外， JAIP 內部邏輯同樣也可以透過此方法來控制。例如我們可以使用 native method 來控制 thread 的 schedule 策略。

附錄一： Method Profiler 之設計

Method Profiler 主要由兩個 table 組成，分別為 Profile Stack 和 Profile Table。

Profile Stack 為一個堆疊結構，堆疊每一個條目為 160bit，分別存放 32bit 的計時器時間 4 項，以及一個 32bit 的方法 ID。Profile Table 也是每一條目皆為 160bit 的表格，存放了每一個方法的 4 種計時器之時間累積，剩餘的 32bit 則存放此方法被呼叫次數。Profile Table 的每一條目分別代表每一個方法的 Profile。每當方法被呼叫的時候將其被呼叫方法 ID 和 4 個計時器之時間分別推進堆疊頂端。當方

Profile Stack entry

Stack Buttom 0 1

…

圖 28 Method Profiler 內之 Profile Table

Method Profiler 架構如圖 27 所示。內部電路主要包含由 BRAM 實作的 Profile Stack 及 Profile Table 和一個控制電路的 FSM。Input 信號有四組 counters 值、

invoke_flag、return_flag 及 invoke_method_ID。invoke_flag 若升起，直接將 4 個 counters 值和 method_ID push 進 Profile Stack 上。當 return_flag 升起，counters 值會被暫存到 return_time_stmp 之 flip-flops 中，代表執行 return 動作當下的時間戳。

除此之外，MP_FSM 會被啟動。MP_FSM 一共有四個狀態，分別是 normal、

load_time_stmp、load_time_accm 和 update_prof。load_time_stmp 狀態是將 Profile Stack 頂端之資訊讀出，其資訊除了 invoke 此 method 時之時間戳外，還包含了 method ID，此 ID 為當初因為 invoke_flag 升起所以被 push 進去的 method_ID，那麼當 return_flag 升起時，pop 出的 method_ID 即代表目前執行 return bytecode 之 method_ID。load_time_accm 是根據剛剛 pop 出來的 method ID 讀取 Profile Table 上對應的 method profile，也就是目前此 method 累計的四個 counters 計數值，以及 method 呼叫次數。最後的狀態 update_prof 是根據上一個狀態讀出的累計 counters 計數值，加上此次 method 執行時間和執行次數。而此次 method 執行時間很容易由 return_time_stmp flip-flop 上的時間戳和 profile stack pop 出的時間戳相減求得。

HW_time Intrpt_

time DSRU_

Profile Table entry

圖 29 Method Profiler

附錄二： Bytecode Profiler 之設計

為了要量測每個 bytecode 在 decode stage 之後執行之 cycles 數目，除了 hardware counters 之外，bytecode profiler 需要兩項資訊：目前正要執行的 bytecode 及一個 issued 信號。Bytecode profiler 透過 issued 信號每次拉起的時間間隔，來計算每個 bytecode 執行時間，每次 issued 拉起，都同時代表目前 bytecode 之執行開始與上一個 bytecode 之執行結束。由於 decode stage 僅有已轉換好的 j-code，為了要得到 j-code 對應的 bytecode，我們在 pipeline 上建立另外的信號線及正反器，

來傳遞 bytecode 到 decode stage。BEE 之 fetch stage 會將每個 bytecode 分類成為 simple 與 complex 兩種類型。Simple 類型之 bytecode 會被轉換成為一個 j-code；

complex 類型會被轉換成為一串 j-code sequence，這些 j-code 由 fetch stage 根據 bytecode 查找後送出到 decode stage。Issued 信號是由 fetch stage 拉起。在遇到 simple bytecode 時，直接拉起；而 complex bytecode 時，僅在 j-code sequence 中的第一個 j-code 送出時拉起。Fetch stage 將 issued 信號傳遞到 decode stage，再從 decode stage 送到 bytecode profiler。

圖 30 Profile Table in Bytecode Profiler

bytecode_profile₀

Profile Table entry

圖 31 Bytecode Profiler

Bytecode profiler 設計如圖 31 所示。與 method profiler 一樣，透過四個 counter 紀錄 HW、Intrpt、DSRU 及 Heap 時間。圖中 bytecode_1、issued_1、bytecode_2 及 issued_2 是由修改後 BEE 之 decode stage 傳送，issued_1 代表一道 bytecode 剛開始執行，其 bytecode 會由 bytecode_1 同時傳入；若兩個 simple bytecode 被 double-issued，其第二個 issued 資訊和 bytecode 會由 bytecode_2 及 issued_2 傳入。

由於完整 bytecode 執行時間必須要等 issued 拉起兩次才會得知，每次 issued 都先記錄目前時間戳，等到 issued 再度被拉起時，再將其時間間隔累加到 bytecode profile 中。issued 信號拉起時主要有四件事：1.透過 bytecode 將 profile table 中對應的 profile 讀出，並存到 profile_buf_1 或 profile_buf_2。2.將目前 counters 值放入 time_stmp_1 或 time_stmp_2。3.將目前正執行的 bytecode 放入 bytecode_buf_1 或 bytecode_buf_2，並將 bytecode_buf_valid flag 設定為 1。

timerHW_

Profile Table 1

Profile Table 2

subtractors

參考文獻

[1] D. E. Knuth, J. Morris, James H, and V. R. Pratt, "Fast pattern matching in strings," SIAM journal on computing, vol. 6, pp. 323-350, 1977.

[2] R. S. Boyer and J. S. Moore, "A fast string searching algorithm,"

Communications of the ACM, vol. 20, pp. 762-772, 1977.

[3] R. A. Wagner and M. J. Fischer, "The string-to-string correction problem,"

Journal of the ACM (JACM), vol. 21, pp. 168-173, 1974.

[4] M. Aldwairi, T. Conte, and P. Franzon, "Configurable string matching hardware for speeding up intrusion detection," ACM SIGARCH Computer Architecture News, vol. 33, pp. 99-107, 2005.

[5] Y. H. Cho and W. H. Mangione-Smith, "A pattern matching coprocessor for network security," in Proceedings of the 42nd annual Design Automation Conference, 2005, pp. 234-239.

[6] C. R. Clark and D. E. Schimmel, "Scalable pattern matching for high speed networks," in Field-Programmable Custom Computing Machines, 2004. FCCM 2004. 12th Annual IEEE Symposium on, 2004, pp. 249-257.

[7] J. M. O'connor and M. Tremblay, "picoJava-I: The Java virtual machine in hardware," Micro, IEEE, vol. 17, pp. 45-53, 1997.

[8] M. Schoeberl, "A Java processor architecture for embedded real-time systems,"

Journal of Systems Architecture, vol. 54, pp. 265-286, 2008.

[9] H.-W. Kuo, Z.-G. Lin, Z.-J. Guo, and C.-J. Tsai, "Double-Issue Java

Accelerator IP for Embedded SoC," in Proc. of VLSI Design/CAD, Keng-Ting, Taiwan, 2011.

[10] D. S. Hardin, "Real-time objects on the bare metal: an efficient hardware realization of the Java< sup> TM</sup> Virtual Machine," in Object-Oriented Real-Time Distributed Computing, 2001. ISORC-2001. Proceedings. Fourth IEEE International Symposium on, 2001, pp. 53-59.

[11] U. Brinkschulte, C. Krakowski, J. Kreuzinger, and T. Ungerer, "A multithreaded java microcontroller for thread-oriented real-time

event-handling," in Parallel Architectures and Compilation Techniques, 1999.

Proceedings. 1999 International Conference on, 1999, pp. 34-39.

[12] A. Inc. (2004) Jazelle technology: ARM acceleration technology for the Java Platform.

[13] N. C. inc. JSTAR-Java Coprocessor for ARM Microprocessors.

[14] H.-W. Kuo, "Design of Java Accelerator IP for Embedded Systems," Master, Computer Science, NCTU, 2011.

[15] C.-H. A. Hsieh, J. C. Gyllenhaal, and W.-m. W. Hwu, "Java bytecode to native code translation: The Caffeine prototype and preliminary results," in

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture, 1996, pp. 90-99.

[16] S. Meloan, "The Java HotSpot performance engine: An in-depth look," Sun Microsystems, Jun, 1999.

[17] H.-J. Ko, "A Double-issue Java Processor Design for Embedded Application,"

Master, Computer Science, NCTU, 2007.

[18] H.-J. Ko and C.-J. Tsai, "A Double-issue Java Processor Design for Embedded Application," in Proc. of IEEE Int. Symp. on Circuits and Systems, Seattle, 2007.

[19] Z. G. Lin, "Design of Stack Memory Device and System Software for Java Accelerator IP," Master, Computer Science, NCTU, 2011.

[20] Z.-G. Lin, H.-W. Kuo, Z.-J. Guo, and C.-J. Tsai, "Stack memory design for a low-cost instruction folding Java processor," in Circuits and Systems (ISCAS), 2012 IEEE International Symposium on, 2012, pp. 3226-3229.

[21] M. Schoeberl, S. Korsholm, T. Kalibera, and A. P. Ravn, "A hardware

abstraction layer in Java," ACM Transactions on Embedded Computing Systems (TECS), vol. 10, p. 42, 2011.

[22] J. Whitham, N. Audsley, and M. Schoeberl, "Using hardware methods to improve time-predictable performance in real-time Java systems," in Proceedings of the 7th International Workshop on Java Technologies for Real-Time and Embedded Systems, 2009, pp. 130-139.

[23] A. Borg, R. Gao, and N. Audsley, "A co-design strategy for embedded Java applications based on a hardware interface with invocation semantics," in Proceedings of the 4th international workshop on Java technologies for real-time and embedded systems, 2006, pp. 58-67.

[24] Y. Ha, R. Hipik, S. Vernalde, D. Verkest, M. Engels, R. Lauwereins, et al.,

"Adding hardware support to the hotspot virtual machine for domain specific applications," in Field-Programmable Logic and Applications: Reconfigurable Computing Is Going Mainstream, ed: Springer, 2002, pp. 1135-1138.

[25] S. Guccione, D. Levi, and P. Sundararajan, "JBits: A Java-based interface for reconfigurable computing," in 2nd Annual Military and Aerospace Applications of Programmable Devices and Technologies Conference (MAPLD), 1999.

[26] J. Fleischmann and K. Buchenrieder, "Prototyping networked embedded systems," Computer, vol. 32, pp. 116-119, 1999.

[27] Z.-G. Lin, H.-W. Kuo, Z.-J. Guo, and C.-J. Tsai, "Stack Memory Design for a Low-cost Instruction Folding Java Processor," in Proc. Of VLSI Design/CAD,

Keng-Ting, Taiwan, 2011.

[28] E. Lattanzi, A. Gayasen, M. Kandemir, V. Narayanan, L. Benini, and A.

Bogliolo, "Improving Java performance using dynamic method migration on FPGAs," in Parallel and Distributed Processing Symposium, 2004.

Proceedings. 18th International, 2004, p. 134.

[29] S. Liang, The Java TM Native Interface: Programmer's Guide and Specification: Addison-Wesley Professional, 1999.

[30] T. Fast and T. Wall. (October). Java Native Access. Available:

https://github.com/twall/jna

[31] Y. Utan, S. i. Wakabayashi, and S. Nagayama, "An FPGA-based text search engine for approximate regular expression matching," in Field-Programmable Technology (FPT), 2010 International Conference on, 2010, pp. 184-191.

[32] H.-C. Lee and F. Ercal, "RMESH algorithms for parallel string matching," in Parallel Architectures, Algorithms, and Networks, 1997.(I-SPAN'97)

Proceedings., Third International Symposium on, 1997, pp. 223-226.

[33] L. Duc-Hung, K. Inoue, S. Masahiro, and P. Cong-Kha, "An FPGA-Based Information Detection Hardware System Employing Multi-Match Content Addressable Memory," IEICE TRANSACTIONS on Fundamentals of

Electronics, Communications and Computer Sciences, vol. 95, pp. 1708-1717, 2012.

[34] J.-L. Brelet, "An overview of multiple cam designs in virtex family devices,"

Xilinx Inc., Application Notes, vol. 201, pp. 4-5, 1999.

[35] J.-L. Brelet, "Using block RAM for high performance read/write CAMs,"

Xilinx Application Note xapp204, 2000.

[36] Z.-J. Kuo, "Design of Dual-Core Java Application Processor for Embedded Systems," Master, Computer Science, NCTU, 2012.

[37] M. D. Scheemaecker and E. Giguere. NanoXML 1.6.4 for the KVM/CLDC [Online]. Available: http://nanoxml.sourceforge.net/orig/kvm.html

在文檔中在Java處理器上字串操作之效能分析與優化 (頁 64-73)