Synopsis - 針對嵌入式處理器之原始碼能量最佳化的實驗性分析

Chapter 1 Introduction

1.3 Synopsis

The remainder of this thesis is organized as follows. Chapter 2 discusses related work. In Chapter 3, we redefine categories of transformations in source code level and detail the transformations presented. In Chapter 4, an experimental framework is presented and a number of experiments of the transformations are designed to verify their energy-efficiency,

followed by the results and analyses in Chapter 5. Finally the conclusion and future work are given in the last chapter.

Chapter 2 Related Work

As energy consumption in design of embedded systems becomes more and more important, several transformations in source code level [13]-[26] have been proposed to achieve the goal of low energy software design. A number of transformations have been proposed by using different evaluation metrics, such as performance and energy consumption, etc. Besides, researchers don’t consider the impact of instruction set architecture (ISA) on energy consumption. In our research, we adopt StrongARM as our target processor and do a number of experiments for the transformations to check if they can be used for energy consumption savings. The impacts of ISA and APCS on energy consumption are discussed, too. Besides, we also find that the optimization levels and options of compiler impact on energy consumption. In order to evaluate energy consumption of software programs, we adopt EMSIM energy simulator [31] as a part of our framework. Finally, the SUIF compiler system [34] which is a compiler infrastructure is discussed. In the system, passes can be developed to do transformations automatically. We use some passes released by other researchers’ groups in our experiments.

2.1 Transformations in Source Code Level

According to different requirements such as better performance, smaller code size or lower energy consumption, several transformations in source code level are presented and proposed in [13]-[26]. Russell et al. [2] concluded that minimizing software execution time (i.e. improved performance) results in minimized energy consumption. Although it is not always true, we find that improved performance usually accompanies energy consumption

savings.

Optimization is the heart of advanced compiler design. A number of optimizations which may be valuable in improving the performance of the object code produced by a compiler were presented in [13], [14]. Muchnick [13] divided compiler optimizations into two mainly areas: intraprocedural and interprocedural optimizations. Intraprocedural optimizations include redundancy elimination, loop optimizations, procedure optimizations, register allocation, code scheduling, and control-flow and low-level optimizations, etc. Optimization for the memory hierarchy was also presented. Morgan [14] pointed out that the optimizing compiler attempts to use all of the resources of the processor and memory as effectively as possible in executing the application program. Hence, a number of optimizations which are used for transforming the program to get better performance were presented. The optimizations include dominator optimization, interprocedural optimization, dependence optimization, global optimization, instruction scheduling, register allocation, and instruction rescheduling, etc. Compiler optimizations in [13], [14] included many transformations which may reduce energy consumption in source code level.

Brandolese et al. [15], [16] presented a methodology and a set of models supporting energy-driven source to source transformations. They grouped source to source transformations into four main categories according to the code structures they operate on:

loops, data structures, procedures, and control structures and operators. And a number of transformations in different categories were presented.

Chung et al. [17] proposed a new transformation which reduces computational effort by using value profiling and specializing a program for highly expected situations. The goal of this technique is to improve energy consumption and performance by reducing computational effort.

In [18]-[21], a number of transformations which are expected to reduce energy consumption were presented for the purpose of system level power optimization, compiler

optimizations for low power systems, reducing instruction cache energy consumption, and iterative compilation for energy reduction, respectively.

Besides, some transformations presented for different purpose are still useful for energy consumption savings, such as improving data locality with loop transformations [22], augmenting loop tiling with data alignment for improved cache performance [23], optimization of computer programs in C [24], writing efficient C for ARM [25], and transforming and parallelizing ANSI C programs using pattern recognition [26], etc.

2.2 ISA on Energy Consumption

Different ISA of target machine may impact on energy consumption because the source code of program needs to be compiled and assembled to object code according to the instruction set of target machine. The number of instructions generated and which style instructions executed will impact on energy consumption.

In this thesis, we focused on the impact of ARM ISA on energy consumption after transformations, so ARM ISA will be discussed below. The ARM instruction set can be divided into six broad classes of instruction [27]:

z Branch instructions

z Data processing instructions

z Status register transfer instructions

z Load and store instructions

z Coprocessor instructions

z Exception-generating instructions

In our research, we find that data processing, and load and store instructions will impact on energy consumption, so we will detail the two groups later. Based on the following

discussions, we propose new transformations tied to ARM ISA in the ISA-specific transformations section.

2.2.1 Data Processing Instructions of ARM

ARM has 16 data processing instructions as shown in Table 2-1.

Table 2-1 ARM Data processing instructions (Seal [27])

Mnemonic Opcode Action AND 0000 Rd := Rn AND shifter_operand EOR 0001 Rd := Rn EOR shifter_operand SUB 0010 Rd := Rn - shifter_operand RSB 0011 Rd := shifter_operand – Rn ADD 0100 Rd := Rn + shifter_operand

ADC 0101 Rd := Rn + shifter_operand + Carry Flag SBC 0110 Rd := Rn - shifter_operand – NOT(Carry Flag) RSC 0111 Rd := shifter_operand –Rn – NOT(Carry Flag) TST 1000 Update flags after Rn AND shifter_operand TEQ 1001 Update flags after Rn EOR shifter_operand CMP 1010 Update flags after Rn - shifter_operand CMN 1011 Update flags after Rn + shifter_operand ORR 1100 Rd := Rn OR shifter_operand

MOV 1101 Rd := shifter_operand (no first operand) BIC 1110 Rd := Rn AND NOT(shifter_operand) MVN 1111 Rd := NOT shifter_operand (no first operand)

There are 11 addressing modes used to calculate the shifter_operand in an ARM data processing instruction. The impact of the immediate addressing mode on energy consumption is discussed below. As shown in Figure 2-1, this data processing operand provides a constant operand to a data processing instruction. It is encoded in the instruction as an 8-bit immed_8 and 4-bit rotate_imm, so that immediate value is equal to the result of rotating immed_8 (which will be zero extend to 32-bit firstly) right by twice the value in the rotate_imm. Hence,

immediate value must be the value as follows:

z <= 255

z a multiple of 4 between 256 and 1023;

z a multiple of 16 between 1024 and 4095

z a multiple of 64 between 4096 and 16383

z …

If you want to assign a value which is not equal to the above value, you will need more than one instruction to complete your operation.

Figure 2-1 Data processing operands - Immediate (Seal [27])

2.2.2 Load and Store Instructions of ARM

Load and store register instructions of load and store instructions are discussed in this section. They use a base register and an offset specified by the instruction. In offset addressing, the memory address is formed by adding or subtracting an offset to or from the base register value. The offset can be either an immediate or the value of an index register. Register-based offsets can also be scaled with shift operations. For the word and unsigned byte instructions, the immediate offset is a 12-bit number. For the halfword and signed byte instructions, it is an 8-bit number. From the above information, we can find that for the word and unsigned byte instructions (or for the halfword and singed byte instructions), if the absolute value of the offset of the memory address from the base address is greater than 4095 (or 255), it can not use an immediate offset and needs another instruction to store the offset to a register; as a result, it needs more than one instruction to load or store the value of a single register from or to memory.

2.3 APCS on Energy Consumption

The APCS (ARM Procedure Call Standard) is a set of rules which regulate and facilitate calls between separately compiled or assembled program fragments [28]. It defines constraints on the use of registers, stack conventions, the format of a stack-based data structure, the passing of machine-level arguments and the return of machine-level results at externally visible procedure calls, and support for the ARM shared library mechanism.

In this section, we discuss that the rules in the APCS that may impact on energy consumption. The ARM has fifteen visible general registers, a program counter register and eight floating-point registers. As shown in Table 2-2, the role of general and program counter registers in the APCS is described. The APCS defines that each contiguous chunk of the stack shall be allocated to activation records in descending address order. At all instants of execution, sp shall point to the lowest used address of the most recently allocated activation record. The value of sl, fp and sp shall be multiples of four.

It is noted that the mapping from languages-level data types and arguments to APCS words is defined by each language implementation, not by the APCS. Because our research about transformations is focused on C language, C language calling conventions in the APCS are discussed. In an argument list, char, short, pointer and other integral values occupy one word. Char and short values are widened by the C compiler during argument marshalling.

Argument values are marshalled in the order written in the source code of programs. The first four of the remaining argument words are loaded into a1-a4, and the remainder are pushed on to the stack in reverse order. A structure is called integer-like if its size is less than or equal to one word, and the offset of each of its addressable sub-fields is zero. An integer-like structured result is returned in a1.

Now the APCS is obsolete, and the AAPCS (Procedure Call Standard for the ARM Architecture) should be noted. The AAPCS embodies the fifth major revision of the APCS

and third major revision of the TPCS (Thumb Procedure Call Standard). It forms part of the complete ABI (Application Binary Interface) specification for the ARM architecture [29].

Table 2-2 General and program counter registers [28]

Register Name APCS Role

r0 a1 argument 1 / integer result / scratch register r1 a2 argument 2 / scratch register r9 sb/v6 static base / register variable

r10 sl/v7 stack limit / stack chunk handle / register variable r11 fp frame pointer

r12 ip scratch register / new-sb in inter-link-unit calls r13 sp lower end of current stack frame

r14 lr link address / scratch register r15 pc program counter

2.4 Compiler Options that Control Optimization

In this thesis, gcc 2.95.3 is used as our cross compiler to compile our C source code.

Because optimization level and options of compiler impact on energy consumption remarkably, we need to decide what optimization level and options to be used firstly.

Optimization levels of gcc 2.95.3 are shown in Table 2-3.

In embedded systems, there are three optimization levels used frequently, -O0, -O2 and -Os. We use -O0 as our optimization level due to stable consideration in embedded systems and clear analysis of the impact of transformations. Besides, we also use -fomit-frame-pointer option to avoid keeping the frame pointer in a register for procedures that don’t need one and

avoid the instructions to save, set up and restore frame pointers; as a result, it makes an extra register available to be used. It also makes debugging impossible on ARM.

Table 2-3 The optimization levels of gcc 2.95.3 [30]

Optimization Level Description

-O0 (default) This is the default. Do not optimize. In this level, the compiler’s goal is to reduce the cost of compilation and to make debugging produce the expected results. The compiler only allocates variables declared register in registers.

-O1 (-O) Optimize. The compiler tries to reduce code size and execution time.

-O2 Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed trade-off. It turns on all optional optimizations except for loop unrolling, function inlining, and strict aliasing optimizations. It also turns on the ‘’ option on all machine.

-O3 Optimize yet more. It turns on all optimizations specified by ‘-O2’ and also turns on the ‘inline-functions’ option

-Os Optimize for size.

2.5 Evaluation of Energy Consumption

In order to analyze the impact of transformations on energy consumption, we need to find a way to evaluate software energy consumption. Tan et al. [31] presented an energy simulation framework that can be used to analyze the energy consumption characteristics of an embedded system featuring the embedded Linux OS running on the StrongARM processor.

As shown in Figure 2-2, the simulator includes the following component:

1) a model for the StrongARM SA-1100 core, consisting of an instruction set simulator (ISS), simulation models for the instruction cache and data cache and a memory management unit (MMU);

2) a simulation model for 32 MB of system memory;

3) a simulation model for an interrupt controller;

4) simulation models for two timers;

5) simulation models for two UARTs conforming to the Intel 8250 series.

Figure 2-2 Modeled embedded system in EMSIM (Tan et al. [32])

The simulation models are shown on the right half of Figure 2-3 and the sequence of steps involved in using the simulation framework is show on the left. The energy accounting mechanism of EMSIM is task-based. And a function energy stack for each task is used for evaluating the energy consumption of every function in the task. From energy profiling report, we can get information about the number of invoked times, CPU cycles consumed and energy consumption of every function.

The SA-1100 microprocessor is a general-purpose, 32-bit RISC microprocessor with a 16 Kbytes instruction cache, an 8 Kbytes write-back data cache, a minicache, a write buffer, a read buffer, and a memory management unit (MMU) combined in a single chip [33]. Besides, it is software compatible with the ARM V4 architecture processor family. In EMSIM, the 8 Kbytes write-back data cache is replaced by 16 Kbytes one and it doesn’t simulate a minicache. As shown in Figure 2-4, the size of the cache line (block) is 32 bytes and the

caches are 32-way set-associative caches. Replacement policy is round robin within a set.

Figure 2-3 Energy analysis framework of EMSIM (Tan et al. [32])

24 bits

Tag bits + valid flag bit

... ...

...

. . .

Figure 2-4 The 32-way set-associative cache in EMSIM

2.6 SUIF2 Compiler System

The SUIF (Stanford University Intermediate Format) system [34] was developed by Stanford Compiler Group. It is a free compiler infrastructure designed to support collaborative research in optimizing and parallelizing compilers, based upon a program representation, SUIF. It maximizes code reuse by providing useful abstractions and frameworks for developing new compiler passes and by providing an environment that allows compiler passes to inter-operate easily. Now the SUIF group has moved its effort on from SUIF1 to SUIF2.

It also supports some useful tools, such as front ends, converters from SUIF1 to SUIF2 and vice versa, and converters from SUIF2 back to C, etc. Hence, we can write our SUIF compiler to do operations between SUIF intermediate representations (IRs).

Figure 2-5 shows the SUIF system architecture. The components of the architecture are described as follows.

1) Kernel provides all basic functionality of the SUIF system.

2) Modules can be one of two kinds: a set of nodes in the intermediate representation and a program pass.

3) Suifdriver provides execution control over modules.

suifdriver

Figure 2-5 The SUIF system architecture (Aigner et al. [35])

Passes are the mainly part of a SUIF compiler. It typically performs a single analysis or transformation and then writes the results out to a file. To create a compiler or a standalone pass, the user needs to write a “main” program that creates the SuifEnv, imports the relevant modules, loads a SUIF program and applies a series of transformations on the program and eventually writes out the information, as show in Figure 2-6.

Some passes which do transformations were implemented and released in [36], [37]. We would like to thank for their release, so we can use the passes to do some transformations for our experiments.

initialize &

load SUIF environment

save & delete SUIF environment ...

passes (easy to reoder)

Figure 2-6 A typical SUIF compiler (Aigner et al. [35])

Chapter 3 Transformations

A series of transformations operating in source code level were presented in [13]-[25]. In the first section, we firstly explain how we classify transformations, followed by the detail of the transformations in different categories in section 3.2-3.6. Finally the summary of the transformations is given in the last section.

3.1 Classification / Category

According to transformations operating on data or code segments, we firstly divide transformations into two main categories, data and code transformations. In [16], code transformations were grouped into three sub-categories according to the code structures they operate on, including loop, procedural, and control structures and operators transformations.

But they don’t consider the influence of ISA on energy consumption.

Table 3-1 Sub-categories of code transformations

Sub-category of code transformations Description

Loop transformations Modify either the body or control structure of the loop

Control structures and operators transformations Change either specific control structures or operators

Procedural transformations Modify the interface, declaration or body of procedures

ISA-specific transformations Transformations are impacted by ISA

In our research, we find that some code transformations are strongly tied to ISA of target machine. Therefore, our code transformations will include four sub-categories: loop, control

structures and operators, procedural, and ISA-specific transformations. Sub-categories of code transformations are described in Table 3-1.

3.2 Data Transformations

In this section, we present a series of transformations used in modifying data segment of source code. These transformations may result in reduced data cache misses and memory access, etc., and then energy consumption savings is expected.

3.2.1 Scratch-pad Array Introduction

Allocating a smaller array is used in storing the most frequently accessed elements of the larger array [16]. It is expected that spatial locality is improved contributing to reduced data cache misses. It is noted that the increased instructions which are used in refreshing the elements of arrays may reduce performance and increase code size (i.e. instruction cache misses). As a result, it may not reduce energy consumption.

3.2.2 Local copy of global variable

In the procedure which needs to operate on global variables, we can declare local variables and assign the value of global variables to them before the procedure invoked [15].

We then refresh global variables after leaving the procedure. In such a way, this transformation can increase the possibility for compiler to store variables in registers instead of memory (i.e. it reduces data cache misses). But it has the same side effects as above transformation, it may not reduce energy consumption.

3.2.3 Common Sub-expression Elimination

An existence of an expression in a program is a common sub-expression if there is

another existence of the expression whose evaluation always precedes this one in execution order and if the operands of the expression remain unchanged between the two evaluations [13]. Common sub-expression elimination is a transformation which stores the same computation results of common sub-expressions into variables and assigns the value of variables to replace common sub-expressions.

It is noted that this transformations may not always be valuable, because it may be less energy consumption to recompute, rather than allocate another register (or memory) to hold the value. As a result, it does not always reduce energy consumption.

3.2.4 Miscellany

In [15], [16], there are still a number of data transformations presented. Scalarization of array elements introduces temporary variables as a substitute of the most frequently accessed elements of an array. Multiple indirection elimination finds common chains of indirections and uses a temporary variable to store the address. At present, researches about code transformations are still continued proceeding.

3.3 Loop Transformations

Loop transformations operate on the statements which comprise a loop (i.e. these transformations modify either the body or control structure of the loop). Because a large percentage of the execution time of programs is spent in loops, these transformations can have a very remarkable impact on energy consumption. Hence, there are a number of researches and approaches based on loop transformations because of the importance of loop

在文檔中針對嵌入式處理器之原始碼能量最佳化的實驗性分析 (頁 13-0)