In recent years, electronic communication devices, such as mobile phones, information appliances (IAs), personal digital assistants (PDAs), and so on, have attracted much attention and continue growing rapidly year by year. Java is a programming language of good portability, security, reliability, and compatibility. These properties make Java widely used for the development of applications for electronic communication devices.
SRAM and DRAM are the two most common memories adopted in embedded systems.
SRAM is typically faster (by a factor of 10 to 100) but more expensive (by a factor of 20 or more) than DRAM [1], and the difference in speed still keeps increasing nowadays. The rise in the SRAM speed is averagely 50% a year at a similar rate to that in the processor speed [2]
versus only 7% a year for DRAM [3].
Cache is one of the most widespread SRAMs in processors while main memory is usually a DRAM. Due to the widening gap between SRAM and DRAM speeds, the stall cycles resulting from cache misses have become a considerable part in the program execution time. For example, consider a processor with an 8KB direct-mapped instruction cache and an 8KB direct-mapped data cache, there is 30.85% of execution time spent on instruction cache miss stall cycles in our experiments. Obviously, reducing the number of instruction cache misses is an effective way to improve the execution performance.
There are quite a few embedded processors containing not only caches but also a kind of SRAM called scratch-pad memory (SPM) or local memory. In this work, we propose a method of utilizing the scratch-pad memory to reduce instruction cache misses arising during program execution. At runtime, every JIT-compiled method is allocated to the SPM first and may be dynamically reallocated to the main memory or to the SPM according to variations in the program behavior. The experimental results demonstrate that our design can significantly
2
reduce instruction cache misses, thus decrease the program execution time, and improve the execution performance.
1.1 Java Technology
Java technology was introduced by Sun Microsystems in 1991 and gets more and more prevalent in numerous application fields. In order to meet various demands of different application fields, Sun Microsystems has divided Java technology into the following three editions:
Java Platform, Enterprise Edition
Java EE targets transactional, scalable, and database-centered applications on servers and enterprise computers.
Java Platform, Standard Edition
Java SE provides plenty of APIs for creating applications running on servers and personal computers.
Java Platform, Micro Edition
Java ME provides an environment for applications running on small devices with limited memory, display, and power capacity, such as mobile phones, personal digital assistants (PDAs), TV set-top boxes, and printers.
Figure 1-1 shows the components of Java technology and the respective targeted products of different Java platform editions.
In this research, we chiefly aim our design at small devices, targeted by Java Platform, Micro Edition, whose cache capacities are generally not large. Java ME contains many technologies and specifications for constructing a platform that can meet the specific requirements of a small device. Java ME is composed of three elements [4]:
Configuration
3
A configuration provides the most basic set of libraries and virtual machine capabilities.
To fit a wide range of devices with diverse hardware capabilities, Java ME is divided into two configurations, Connected Device Configuration (CDC) and Connected Limited Device Configuration (CLDC). CDC targets larger devices with more capacity and with a network connection, such as smart phones, high-end PDAs, and TV set-top boxes, whereas CLDC fits resource-constrained devices, like mobile phones and low-end PDAs.
Profile
A profile is a set of APIs that support a narrower range of devices.
Optional Package
An optional package is a set of technology-specific APIs.
Figure 1-1 Components of Java Technology and Targeted Products [4]
1.2 Execution of Java Programs
Java programs are first compiled into an intermediate representation, referred to as bytecode, by a Java compiler at static time. When the compilation is finished, the bytecode is
Java Card
4
saved in one or more class files, which are to be fed into a Java virtual machine (JVM) for execution. The class loader in a JVM is responsible for loading a class file into the memory heap on demand throughout program execution. In the course of execution, the class loader loads class files into the memory heap for the interpreter in the JVM to interpret the bytecode.
Although it is easy to implement an interpreter, its slow performance makes it unsuitable for those environments where the performance is an essential consideration. In order to overcome this problem, an approach that a Just-In-Time (JIT) compiler is integrated into a JVM has been proposed.
For a JVM that comprises an interpreter and a JIT compiler, Java programs may be executed in a mixed mode, the mixture of the aforementioned interpretation mode and the JIT-compilation mode, as illustrated in Figure 1-2. Likewise, Java programs have to be compiled into bytecode by a Java compiler first. When a program begins running, the JVM executes it by directly interpreting the bytecode of the program. At the same time, the number of invocations and backward branches of each method are counted to calculate the popularity value of each method. Once the popularity value of a method reaches the popularity threshold, meaning that the method is executed frequently enough, the JIT compiler is triggered to translate the bytecode of the method into native machine code, and a free space is allocated from the code buffer in the main memory to store the compiled code. When the method is executed afterwards, it need not be compiled again since the compiled code has been stored in the code buffer, and the compiled code can be fetched immediately for execution.
The JIT compilation is performed during execution of a Java program, and therefore it must bring additional runtime overhead. However, because execution of native machine code is far faster than interpretation of bytecode, mixed-mode execution still speeds up Java-program execution a great deal.
5
Figure 1-2 Mixed-Mode Execution of Java Programs
1.3 Code Buffer for Storing JIT-Compiled Code
As mentioned above, when the popularity value of a method reaches the popularity threshold, the bytecode of the method is compiled into native machine code by the JIT compiler, and a free space needs to be allocated from the code buffer in the main memory to store the compiled code for future utilization. From the perspective of a processor, emitting JIT-compiled code is the same as writing data into the main memory. In the case of a data cache with the write-allocate policy, JIT-compiled code is first written into the data cache and then written into the code buffer in the main memory. The process of writing JIT-compiled code into the code buffer is described in Figure 1-3.
Any JIT-compiled code must be first loaded into the instruction cache and then can be executed by a processor. If the compiled code that is going to be executed has been in the instruction cache, the processor can execute it straight out of the instruction cache. Otherwise, an instruction cache miss will occur and cause the processor pipeline to stall for a number of cycles, referred to as cache miss penalty, until the compiled code is loaded from the code buffer in the main memory into the instruction cache. The cache miss penalty is considerable,
6
perhaps several dozen cycles to several hundred cycles. After the compiled code is loaded into the instruction cache, the processor can proceed with executing it out of the instruction cache.
The process of reading compiled code from the code buffer is illustrated in Figure 1-4.
Figure 1-3 Writing JIT-Compiled Code into Code Buffer
Figure 1-4 Reading JIT-Compiled Code from Code Buffer
1.4 Observation on Instruction Cache Misses for Java
In our experiments on Java applications, we found that instruction cache miss stall cycles constitute a considerable part of the program execution time (30.85% for the environment that has an 8KB direct-mapped instruction cache and an 8KB direct-mapped data cache without L2 caches), and over half (50.57%) of the instruction cache miss stall cycles are caused by JIT-compiled code, as shown in Figure 1-5. To decrease the instruction cache misses, we may make use of a scratch-pad memory to place the JIT-compiled code that frequently incurs instruction cache misses in it.
7
Figure 1-5 Breakdown of Execution Time for the Environment Containing an 8KB Direct-Mapped Instruction Cache and an 8KB Direct-Mapped Data Cache
1.5 Scratch-Pad Memory (SPM)
A scratch-pad memory is a memory array, which consists of SRAM memory cells, with decoding circuitry and column circuitry as depicted in Figure 1-6. A scratch-pad memory is commonly an on-chip memory. A study [5] had been made to compare the area cost and the energy consumption between scratch-pad memory and cache, and the results indicate that a scratch-pad memory has 34% smaller area and 40% less energy consumption than a two-way set-associative cache of the same capacity. Additionally, through our conversion using CACTI 4.1 [6], the area of a scratch-pad memory is 31% smaller than that of a direct-mapped cache of the same capacity. However, unlike a cache, which is invisible to software, the allocation of instructions or data in a scratch-pad memory relies on software’s control and hence is visible to software. To strike a balance between scratch-pad memory and cache, a good few embedded processors, such as ARM10E, PXA270, ColdFire MCF5, IXP, and PowerPC 405, have a scratch-pad memory as well as one or more caches. A brief comparison between scratch-pad memory and cache is summarized in Table 1-1.
8
Figure 1-6 Scratch-Pad Memory Organization
Table 1-1 Comparison between Cache and Scratch-Pad Memory
Cache Scratch-Pad Memory
Loading Time At Runtime Before or At Runtime
Controlled by Hardware Software
Allocation Visibility Invisible to Software Visible to Software Area Cost Ratio [5] 1 (Direct-Mapped) 0.69
1.6 Research Motivation
Owing to the great gap between cache and main memory speeds, when a cache miss occurs, it takes a large number of cycles (perhaps several dozen cycles to several hundred cycles) to load instructions or data into the cache, making cache miss stall cycles play an important role in the program execution performance. We observed that instruction cache miss stall cycles occupy a considerable part of the program execution time, and over half of the instruction cache miss stall cycles are caused by JIT-compiled code. In other words, a large portion of the execution time is spent on instruction cache miss stall cycles caused by JIT-compiled code. If we place the JIT-compiled code in the SPM that frequently incurs instruction cache misses, a lot of instruction cache misses will be eliminated, and thus the execution time may be decreased significantly.
Besides, an SPM has the advantages of lower area cost and less energy consumption than
9
a cache of the equal capacity but introduces the overhead of software maintenance for SPM allocation. Hence, quite a few embedded processors, such as ARM10E, PXA270, ColdFire MCF5, IXP, and PowerPC 405, contain an SPM along with one or more caches to strike a balance between SPM and cache. For these processors, it is essential to develop an efficient SPM allocation scheme to make good use of the SPM.
So far, no successful SPM allocation scheme can always dynamically adjust SPM allocation exactly according to variations in the program behavior throughout program execution. As a result, if the behavior of a program (e.g. an interactive application) varies during the course of program execution, the SPM allocation may be unable to fit the program behavior anymore, leading to diminishing benefits from using the SPM. However, we can move the JIT-compiled methods to the SPM whenever they cause numerous instruction cache misses during program execution. In consequence, we may devise a dynamic SPM allocation scheme that has the capability of regulating the selection of JIT-compiled methods in the SPM with variations in the dynamic program behavior.
1.7 Research Objective
This research aims to reduce instruction cache misses that arise during execution of Java applications by identifying the JIT-compiled methods that incur more instruction cache misses at runtime and allocating them to the SPM dynamically. Furthermore, the JIT-compiled methods allocated to the SPM need to be adjustable according to variations in the program behavior throughout program execution. Besides, because the dynamic allocation method works at runtime, it must be low-overhead sufficiently to prevent the gain from being offset by the runtime overhead. In a word, the goal of this research is to design a low-overhead dynamic SPM allocation approach for JIT-compiled code to reduce instruction cache misses, thereby decrease the execution time of Java applications, and improve the execution
10
performance.
1.8 Thesis Organization
The rest of this thesis is organized as follows. Chapter 2 introduces the background knowledge and the related work in reference to our research. Chapter 3 presents the memory hierarchy and the components of the execution environment in the design of dynamic SPM allocation for JIT-compiled code. Chapter 4 describes our experimental process and gives the experimental results and some analyses. Chapter 5 presents the conclusion and the future work.
11