Background - 支援多核心架構之程式轉換技術

Embedded systems are essentially a highly dynamic field because of the fast pace of innovations and development in hardware, operating systems, and application development technologies. With the tremendous commercial potential at stake, as well as relatively smaller investment companies require to join the game, it is not difficult to see the embedded systems market is filled with a huge and diverse array of hardware devices and associated system software platforms. This suggests that the dimensions of design space for embedded systems also increase dramatically. One would have to consider target-system-specific transformations, such as memory optimization requiring target architecture-dependent loop transformations, optimized word length selection, and process restructuring for fine-grain load distribution, etc.

As stated in [Ernst], “the problem is worse here than with parallel compilers because of architecture specialization.”

Research interests in optimization techniques for embedded systems have grown in recent years, in an attempt to understand the various optimization strategies applicable to the design of embedded systems in terms of time-space performance and/or power consumption.

For example, some conferences and workshops such as LCTES and ODES are devoted to such goal. Compiler and optimization research for embedded systems shares the same foundation with general compiler techniques, but has specific challenges to address:

 Energy awareness. Many approaches address the energy consumption issues that are crucial for embedded systems (e.g. [Lambrechts], [VanderAa03], [VanderAa05]).

 Memory hierarchy. Some compiler techniques explore the characteristics of memory hierarchies that differ from one embedded system from another ([Chen], [Grewal], [Ozturk], [Sanghai07]).

 Transformation and code generation. There are also efforts that propose code generation and transformation techniques for embedded systems. In addition, there are also efforts concerned with specific problem domains such as algorithms for FFT or other multimedia applications, and develop specific code generation methods for embedded systems. Examples include [Ali], [Alur], [Burgaard], [Cheng], [Franke], [Hong].

 Multi-core architecture. Work in this category concerns the parallelization aspect for multi-core embedded systems or more general multi-processor architecture ([Dupre], [Perkins], [Sanghai05]), and is closely related to traditional parallelizing compiler research..

 Profiling-based optimization. Work in this category proposes methods that guide compiler with profiling information, possibly obtained from actual execution of automatically generated programs. ([Cavazos], [Peri], [Zhao05], [Suresh]).

 Framework and methods. There are also research efforts proposing general framework and methods that can are applicable for embedded systems. ([Aarts], [Barat], [Fulton], [Pan], [Vachharajani], [Zhao03]).

The references presented above are just samples from the vast literature in compiler research. Furthermore, the categories above are not mutually disjoint, since it is often the case that a research effort will consider multiple aspects simultaneously. In what follows we focus on research and development relevant to the area of compiler techniques, or more specifically program analysis and transformation systems for multi-core embedded systems, without probing further into areas such as energy consumption or more hardware-oriented research.

For efforts related to program transformation, [Franke] investigates source-level transformation for embedded systems, which incorporates a probabilistic feedback-driven search for proper transformation sequences. Specifically, it combines a simple random search for space exploration and a focused search based on a machine learning approach, in order to help reducing the extremely huge search space. [Lee] investigates the case of dual instruction set processors that are increasingly popular for embedded systems. Typically, programs compiled with a reduced instruction set (16 bits/instruction) have smaller code size but run slower, but run faster with larger code size when compiled with a full instruction set (32 bits/instruction). Thus [Lee] first compiles a program with the reduced instruction set first, and then analyses and selects a set of basic blocks of the program and transforms them using the full instruction set that gives the maximum performance gain while maintaining the code size under a given upper bound.

When the development of the hardware is also part of an embedded system project, the matter becomes more involved. [Ernst] provides an overview of research in hardware/software co-design of embedded systems. [Bennett] proposes a framework that

combined automated code transformation and ISE generators to explore the potential benefits of such a combination. Although the design space is even larger in the hardware/software co-design context, what was discussed previously is still useful. Still, there are a number of points worth pointing out:

An integrated and coherent co-design system should capture the complete design specification, including hardware and software models, and support design space exploration with optimization based on this specification

Synchronization and integration of hardware and software design becomes an issue.

The boundary between hardware and software is interesting. Although this aspect resembles the boundary between programs written in high-level languages and those using assembly languages (to boost performance for a critical part), the hardware/software boundary is “permanent” and thus requires much rigid analysis and careful decisions.

General-purpose source code transformation frameworks are also relevant in our study.

For example, [Cordy] presents the TXL system that supports all aspects of parsing, pattern matching, transformation rules, application strategies and unparsing within a specially designed language with no dependence on other tools or technologies. [Bravenboer] is another framework, called Stratego/XT, that is essentially a collection of reusable components and tools for the development of transformation systems, where transformation components are implemented using the Stratego language that provides rewrite rules for expressing basic transformations, programmable rewriting strategies for controlling the application of rules, concrete syntax for expressing the patterns of rules in the syntax of the object language, and dynamic rewrite rules for expressing context-sensitive transformations.

Another important category of compiler research for embedded systems concerns parallelization. For multi-core architecture this usually amounts to the partition of a program into parts each may sit on different cores. [Suresh] provides both compiler-based and simulation-based loop analyzers that profile an application, and a loop analysis toolset to support hardware software partitioning. These tools are used to identify core fragments of programs of many benchmarks that are executed most frequently. However, these core fragments are factored out into hardware in a manual way. [Kim] presents a source code analysis technique that, although not targeting embedded systems per se, is still relevant to

parallelization and program transformation. The technique attempts to extract so-called pre-execution code from an ordinary program such that the pre-execution code can be placed as another helper thread that runs “in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation that can be issued prior to the main program.”

When more general compiler techniques are concerned, the space for exploration includes many dimensions. A particularly important dimension is the kinds of optimizations that can be performed. This dimension is profound due to the ever increasing complexity of hardware architecture. There are also enormous research efforts on applying existing, general optimizations to embedded systems. An even more challenging dimension regards the potential interferences among optimizations. It is well known that the order of optimizations can affect the quality of the final result; some optimizations may enable or disable future optimizations. [Zhao03] investigated the impact of optimizations to embedded systems.

[Franke] mentioned above also discusses its program transformation framework for embedded systems from this optimization space exploration perspective. The iterative compiler approach (e.g. [Aarts]) addresses this by performing optimizations in different ways, and observe the performance characteristics of the actual generated code rather than relying on heuristics or abstract performance models. [Triantafyllis] proposes a more elaborated framework for exploring the space of optimizations. In this framework, a compiler optimizes each code segment with a variety of optimization configurations and examines the code after optimization to choose the best version produced. Because this finer-grained iterative compiler approach results in larger search space, the framework also proposes methods to reduce the search space. [Pan] builds a feedback-directed optimization orchestration algorithm which searches for the combination of optimization techniques that achieves the best program performance. The algorithm attempts to successively identify and removes harmful optimizations, measured through a series of program executions, with the goal to reduce the number of compilations while maintain the quality of the generated code.

Because our transformation framework is geared towards Java programs, it is also worth considering related work for embedded Java. The Java Platform, Micro Edition (J2ME) is a set of technologies and specifications developed for small devices like pagers, mobile phones, and set-top boxes. J2ME uses smaller-footprint subsets of Java SE components, such as smaller virtual machines and leaner APIs, and defines a number of APIs that are specifically

targeted at consumer and embedded devices. It is proposed to enable users, service providers, and device manufacturers to take advantage of a rich portfolio of application content that can be delivered to the user’s device on demand, by wired or wireless connections.

Although JVMs can be implemented using straightforward interpreters, the most popular approach to improving JVM performance is to replace or augment the interpreter with a just-in-time (JIT) compiler, which transforms the bytecode into machine code that can be executed by the host machine directly. Depending on the level of optimization applied, the translated machine code may even approximate the speed of equivalent programs written in C.

However, the JIT compilation phase may be quite complicated, also depending on the type of optimizations performed, and require substantial memory and processing time. In addition, especially for embedded systems, the impact of this in terms of user experience can be very significant, particularly at application start-up, as a device appears unresponsive for a long period of time. It becomes obvious that compiling all bytecode into native code may incur too much overhead, especially statistically speaking a large portion of the bytecode is executed vary rarely or even not executed at all.

Accordingly, some researchers try to design lightweight and efficient JIT compilers. For example, in [Tabatabai] a JIT compiler for the Intel IA32 architecture is proposed that generates native IA32 instructions directly from the byte codes, in a single pass. Other than a control-flow graph used for register allocation, the JIT does not generate an explicit intermediate representation. Rather, it uses the byte codes themselves to represent expressions and maintains additional structures that are managed on-the-fly. This is in contrast to other Java JIT implementations which transform byte codes to an explicit intermediate representation. Another example described in [Shudo] presents cost-effective code generation and optimization methods by means of template connecting. The code generator basically connects pre-fabricated templates of native code corresponding to internal instructions. In addition to the technique, stack caching [Ertl] was implemented in the compiler and the technique makes use of multiple registers over templates.

Therefore, most JIT compilers are selective about the bytecode to compile, often on a method basis. That is, they collect the frequency of each loaded method and only compile those methods that are (expected to be) frequent. Furthermore, because the more

optimizations to apply, the more memory and processing time are needed, many JIT compilers have multiple levels of optimizations each with different memory and processing time requirements, so that different optimizations can be applied to different methods based on their relative frequencies.

Another way of addressing the issue of compilation overhead is through the so-called dynamic adaptive compilers (DAC), sometimes also referred to as mix-mode interpreters. In a DAC, bytecodes are initially executed by interpretation while software profiles the code and determines key code sections to be compiled. Some variations avoid the interpretation all together and convert the bytecode into native code immediately using inexpensive translation.

Once the key code sections, mostly methods, that are identified as hot, they are compiled using more advanced optimization options. Furthermore, the available levels of optimizations may be more than one, so that only extremely hot sections receive extensive analysis and optimizations. The popular Jikes RVM (research virtual machine) employs such an MMI approach.

Another category of JVM optimization that is more relevant to our study is the so-called ahead-of-time (AOT) compilation. As the name suggests, if the Java programs and/or bytecode are known prior to their execution, we can employ traditional compiler steps by translating them into native code in advance. Extensive program analysis and optimizations can therefore be applied without limitation. We believe AOT compilation is a very important technique for embedded systems because there are often core applications that should be bundled with a given embedded system. Note that if the embedded Java system is required to download and execute bytecode dynamically, JIT compiler is still essential.

Optimized ahead-of-time compilation attempts to produce code having size and speed comparable to code written in C/C++, while remaining compatible with the Java world, allowing for the mixing and matching of code according to individual system requirements.

Some AOT compilers such as GCJ translate Java programs into native code (through the common GCC backend), while others may translate Java into C and rely on C compiler to perform sophisticated optimizations. Note that AOT compilation does not conflict with JIT compilation. In fact, many AOT compilation frameworks still require JVMs to process and execute bytecode if it is allowed to execute Java applications containing both natively compiled part and bytecode part.

AOT compilers are usually not for dynamically downloaded classes. It is nevertheless possible to invoke existing JIT compilers on bytecode to obtain and cache the compiled machine “ahead of time,” thereby reducing the time for class loading and optimization substantially. This simpler approach to AOT compilation comes almost free because it makes use of the JIT compiler that already exists. CVM actually provides such facility that allows engineers to customize additional classes for AOT compilation and to include them in the final binary footprint. [Hong] also extends from the idea and that proposes a “client-side”

compilation during run time for downloaded classes. The idea works because in certain embedded applications such as set-top boxes where applications may be executed many times after they are downloaded, hence the time saved for class loading and optimizations can be significant.

在文檔中支援多核心架構之程式轉換技術 (頁 13-20)