SUIF2 Compiler System - Related Work - 針對嵌入式處理器之原始碼能量最佳化的實驗性分析

Chapter 2 Related Work

2.6 SUIF2 Compiler System

The SUIF (Stanford University Intermediate Format) system [34] was developed by Stanford Compiler Group. It is a free compiler infrastructure designed to support collaborative research in optimizing and parallelizing compilers, based upon a program representation, SUIF. It maximizes code reuse by providing useful abstractions and frameworks for developing new compiler passes and by providing an environment that allows compiler passes to inter-operate easily. Now the SUIF group has moved its effort on from SUIF1 to SUIF2.

It also supports some useful tools, such as front ends, converters from SUIF1 to SUIF2 and vice versa, and converters from SUIF2 back to C, etc. Hence, we can write our SUIF compiler to do operations between SUIF intermediate representations (IRs).

Figure 2-5 shows the SUIF system architecture. The components of the architecture are described as follows.

1) Kernel provides all basic functionality of the SUIF system.

2) Modules can be one of two kinds: a set of nodes in the intermediate representation and a program pass.

3) Suifdriver provides execution control over modules.

suifdriver

Figure 2-5 The SUIF system architecture (Aigner et al. [35])

Passes are the mainly part of a SUIF compiler. It typically performs a single analysis or transformation and then writes the results out to a file. To create a compiler or a standalone pass, the user needs to write a “main” program that creates the SuifEnv, imports the relevant modules, loads a SUIF program and applies a series of transformations on the program and eventually writes out the information, as show in Figure 2-6.

Some passes which do transformations were implemented and released in [36], [37]. We would like to thank for their release, so we can use the passes to do some transformations for our experiments.

initialize &

load SUIF environment

save & delete SUIF environment ...

passes (easy to reoder)

Figure 2-6 A typical SUIF compiler (Aigner et al. [35])

Chapter 3 Transformations

A series of transformations operating in source code level were presented in [13]-[25]. In the first section, we firstly explain how we classify transformations, followed by the detail of the transformations in different categories in section 3.2-3.6. Finally the summary of the transformations is given in the last section.

3.1 Classification / Category

According to transformations operating on data or code segments, we firstly divide transformations into two main categories, data and code transformations. In [16], code transformations were grouped into three sub-categories according to the code structures they operate on, including loop, procedural, and control structures and operators transformations.

But they don’t consider the influence of ISA on energy consumption.

Table 3-1 Sub-categories of code transformations

Sub-category of code transformations Description

Loop transformations Modify either the body or control structure of the loop

Control structures and operators transformations Change either specific control structures or operators

Procedural transformations Modify the interface, declaration or body of procedures

ISA-specific transformations Transformations are impacted by ISA

In our research, we find that some code transformations are strongly tied to ISA of target machine. Therefore, our code transformations will include four sub-categories: loop, control

structures and operators, procedural, and ISA-specific transformations. Sub-categories of code transformations are described in Table 3-1.

3.2 Data Transformations

In this section, we present a series of transformations used in modifying data segment of source code. These transformations may result in reduced data cache misses and memory access, etc., and then energy consumption savings is expected.

3.2.1 Scratch-pad Array Introduction

Allocating a smaller array is used in storing the most frequently accessed elements of the larger array [16]. It is expected that spatial locality is improved contributing to reduced data cache misses. It is noted that the increased instructions which are used in refreshing the elements of arrays may reduce performance and increase code size (i.e. instruction cache misses). As a result, it may not reduce energy consumption.

3.2.2 Local copy of global variable

In the procedure which needs to operate on global variables, we can declare local variables and assign the value of global variables to them before the procedure invoked [15].

We then refresh global variables after leaving the procedure. In such a way, this transformation can increase the possibility for compiler to store variables in registers instead of memory (i.e. it reduces data cache misses). But it has the same side effects as above transformation, it may not reduce energy consumption.

3.2.3 Common Sub-expression Elimination

An existence of an expression in a program is a common sub-expression if there is

another existence of the expression whose evaluation always precedes this one in execution order and if the operands of the expression remain unchanged between the two evaluations [13]. Common sub-expression elimination is a transformation which stores the same computation results of common sub-expressions into variables and assigns the value of variables to replace common sub-expressions.

It is noted that this transformations may not always be valuable, because it may be less energy consumption to recompute, rather than allocate another register (or memory) to hold the value. As a result, it does not always reduce energy consumption.

3.2.4 Miscellany

In [15], [16], there are still a number of data transformations presented. Scalarization of array elements introduces temporary variables as a substitute of the most frequently accessed elements of an array. Multiple indirection elimination finds common chains of indirections and uses a temporary variable to store the address. At present, researches about code transformations are still continued proceeding.

3.3 Loop Transformations

Loop transformations operate on the statements which comprise a loop (i.e. these transformations modify either the body or control structure of the loop). Because a large percentage of the execution time of programs is spent in loops, these transformations can have a very remarkable impact on energy consumption. Hence, there are a number of researches and approaches based on loop transformations because of the importance of loop transformations.

3.3.1 Loop Fusion

This transformation combines one or more loops with the same bounds into a single loop [21]. It reduces loop overhead; as a result, the number of the instructions executed is reduced.

Besides, it can also be used for improving data cache locality by bringing the statements that access the same set of data to the same loop [20]. But it is noted that if the increased loop body becomes larger than instruction cache, it will increase instruction cache misses; as a result, energy consumption will be increased.

3.3.2 Loop Fission

Loop fission does the opposite operation to loop fusion [21]. The goal of this transformation is to break down larger loop body into smaller ones to reduce the size of loop body to fit into instruction cache, and then it reduces instruction cache misses. It is noted that computation energy is increased due to the new loop overheads. Therefore, we need to be careful to decide that loop fusion or loop fission should be applied to loops in order to reduce energy consumption effectively.

3.3.3 Loop Reversal

This transformation reverses the order in which a specific loop’s iterations are performed [13]. In some loops of which loop body exists dependence, loop reversal may eliminate the dependence; as a result, it allows other transformations to be applied. Besides, a special case of loop reversal which is useful in ARM architecture was presented in [25]. To apply loop reversal transformation makes an incrementing loop to a decrementing loop which becomes a count-down-to-zero loop. It causes the original ADD/CMP instruction pair to be replaced by a single SUBS instruction; because of this, it saves compares in critical loops, leading to reduced code size, increased performance and reduced energy consumption.

3.3.4 Loop Inversion

Loop inversion transforms a while loop to a repeat loop (i.e. it moves the loop conditional test from before the loop body to after it) [13]. It results in only one branch instruction needed to be executed to leave the loop, rather than one needed to return to the beginning and another needed to leave after the loop conditional test at beginning. Hence, it is expected that energy consumption is reduced due to the reduced number of the instructions executed. It is noted that this transformation is only safe when the loop body is executed at least once.

3.3.5 Loop Interchange

This transformation reverses the order of two adjacent loops in a loop nest to change the access paths to arrays [14]. It improves the chances that consecutive references are in the same cache line, leading to reduced data cache misses. Hence, reduced energy consumption can be expected.

3.3.6 Loop Unrolling

Loop unrolling replaces the body of a loop by U (the unrolling factor) times copies of the body and modifies the iteration step from 1 to U [13]. The original loop is called the rolled loop. Loop unrolling reduces the overhead of a loop by performing less compare and branch instructions (i.e. better performance) and may improve the effectiveness of other transformations, such as common-sub-expression elimination and software pipelining, etc. It also allows the compiler to get a better register usage of the larger loop body.

On the other hand, the unrolled loop is larger than the rolled loop, so it increases code size and may impact the effectiveness of the instruction cache, leading to increased instruction cache misses. So deciding which loops to unroll and by what unrolling factors is very important.

Besides, there is another form of loop unrolling that applies to the loop which is not a

counting loop [14]. It can unroll the loop and leave the termination conditions in place. This technique has benefits when dealing with a while loop in which later transformations can be used.

3.3.7 Loop Unswitching

This transformation moves loop-invariant conditional branches to the outside of loops [13]. It reduces the number of the instructions executed due to the reduced number of codes executed in the loop body. If the conditional only has if part, loop unswitching has little impact on code size. But if the conditional has else parts, it will need to copy the loop into every else parts of conditional. Hence, it increases code size and instruction cache misses.

3.3.8 Miscellany

In addition to the above loop transformations, other energy-efficiency strategies on loop transformations can be envisioned.

Loop permutation is a general version of loop interchange [13]. It allows more than two loops to be reordered to reduce data cache misses; as a result, it is expected to reduce energy consumption.

Loop tiling achieves the goal of reduction of capacity and conflict misses which are resulted from cache size limitations [21]. It improves cache performance by dividing the loop iteration space into smaller tiles. This also results in a logical division of arrays into tiles, so it may increase reuse of array elements within each tile. By the correct selection of tile sizes, conflict misses which occur when several data elements compete for the same cache line can be eliminated. But it also increases the number of instructions executed and code size because of the increased nesting loops.

Software pipelining can improve the execution performance of loops [16]. It eliminates the dependences between adjacent statements by breaking the operations of single loop

iteration into S stages, and arranges the code in such a way that stage 1 is executed on the instructions originally belonging to iteration i, stage 2 on those of iteration i-1, etc. It makes pipeline performance better through pipeline stalls reduction. Hence, CPU cycles consumed are expected to reduce. But it may increase the number of instructions executed due to the calculation of iteration i. It also increases code size because startup code is generated before the loop to initialize and cleanup code is generated after the loop to finish operations.

3.4 Control Structures and Operators Transformations

In this section, we present a series of transformations which modify control structures and operators of source code to reduce energy consumption.

3.4.1 Conditional Sub-expression Reordering

It is possible to reorder sub-expressions in conditional to reduce energy consumption [16]. In OR conditions, we can sort the sub-expressions in which the possibility of being true in front of others. In AND conditions, we can sort the sub-expressions in which the possibility of being false in front of others. By using this transformation, it reduces the number of the instructions executed; as a result, it reduces energy consumption. It has no side effects.

3.4.2 Special Cases Optimization

Brandolese et al. [16] presented special cases optimization transformation which replaces calls to generic library or user-defined functions with optimized ones. For example, someone needs to call mathematic functions of which arguments need floating-point variables, but he only wants to do operations for integers. Hence, he can re-write optimized ones for integers to reduce energy consumption. This transformation is only a suggestion, and it can not be implemented by an automation tool.

3.4.3 Special Cases Pre-evaluation

Some functions would return a known value when a special value for an argument is passed. So we could avoid real calls to the functions by defining suitable macros testing for the special cases [16]. Hence, it may reduce real function calls (i.e. the number of the instructions executed) but increases code size. Figure 3-1 shows some examples.

Figure 3-1 Some examples of macro definition for procedures (Brandolese et al. [16])

3.5 Procedural Transformations

Procedural transformations are used for modifying the interface, declaration or body of procedures. There are also a number of researches in this sub-category for the purpose of performance improved and energy consumption savings.

3.5.1 Procedure Inlining

This transformation is supported by many compilers. It replaces the invoked procedure with the procedural body [16]; as a result, it increases the spatial locality and decreases the number of procedure invoked. But it increases the code size which will result in increased instruction cache misses.

3.5.2 Procedure Integration

It has almost the same behavior as procedure inlining [13], but procedure inlining does not consider the call site. This transformation can differentiate among call sites which invoke

the same procedure and decide which call site is need to do procedure integration or only invokes the original procedure. As a result, it may get a better trade-off between code size and energy consumption than procedure inlining.

3.5.3 Procedure Sorting

This transformation is the easiest instruction cache optimization approach to implement.

It sorts the statically linked procedures according to the call graph and frequency of use [13].

This transformation has two advantages. Firstly, it places procedures near their callers in virtual memory so as to reduce paging traffic. Secondly, it places frequently used and related procedures so they have less possibility to collide with each other in the instruction cache. To implement this transformation, we only need to reorder the procedural declarations.

3.5.4 Procedure Cloning

This transformation is based on procedural parameters which are constant at one or more call sites. For every call site that calls the same procedure and passes the same constant values of parameters, we clone a copy of the procedure and rename its procedural name [13]. The new version of the procedure has reduced parameters and in the body of which constant parameters are replaced by constant values; as a result, it allows compilers to do advanced optimization.

3.5.5 Loop Embedding

Loop embedding is an interprocedural transformation which moves the loop from the outside of a procedure to the body of the procedure [26]; as a result, it reduces the overhead of the procedure call. The original procedure is needed to be reserved if it is called from more than one call site.

3.5.6 Substitution of a Variable Passed as an Address with a Local Variable

This transformation replaces a procedural argument passed as an address with a local copy of variable [15]. In optimization level of compilers, compilers tend to allocate local variables in registers instead of memory. Hence, it reduces the number of memory access and the data cache misses. But it increases the number of codes which assigns and restores values, it will increase the number of the instructions executed; as a result, it may not reduce energy consumption.

3.5.7 Miscellany

Other transformations which operate on procedure include soft inlining which replaces calls and returns with jumps [16], and procedure splitting which divides each procedure into a primary and a secondary component [13], etc.

3.6 ISA-specific Transformations

Some transformations are dependent to what ISA you operate on. In this section, we present one transformation which is not independent of ISA of target machine. We also propose two transformations which are specific to ARM ISA, including dummy variables insertion and arrays declaration permutation transformations. It is noted that the proposed ones are also strongly tied to the strategies of calculating base addresses of compilers.

3.6.1 Arrays Declaration Sorting

This transformation is to modify the order of local arrays declaration, so that the most frequently accessed array is allocated on the top of the stack; in such a way, the memory locations frequently accessed by exploiting direct access mode [15]. It is less energy expensive in this access mode. When using this transformation, you need to know the stack

allocation strategies of local arrays implemented by compliers.

3.6.2 Dummy Variables Insertion

This transformation proposed is based on the feature of ARM ISA. When elements of arrays are accessed, it is necessary to calculate the base addresses of the arrays firstly. This transformation tries to reduce the number of the instructions executed for calculating the base addresses by inserting dummy variables which are declared as volatile ones between the arrays. In such a way, the offsets of the base addresses of arrays from the stack are changed, so that it is possible to use one instruction to get the base addresses of arrays. Because the order of array declarations is changed and the size of stack allocation is increased, it might increase data cache misses and page fault slightly. It is expected that code size and energy consumption will be reduced because of the reduced number of the instructions executed. It is noted that we need to take care of checking if stack overflow will happen after dummy variables insertion.

In this thesis, we follow the pseudocode writing rules in [38] to write our algorithms. We design an algorithm for dummy variables insertion transformation, and we also take some assumptions as follows.

1) Offset is equal to or less than 2²⁶(for the procedure DUMMY-VARIABLE-SIZE(offset) to operate correctly).

2) We suppose that compilers only use ‘add’ instruction to calculate the base addresses of arrays, and the immediate value of the instruction must not be negative value.

3) In order to simplify our algorithm, we also suppose that initial offset passed to the procedure DUMMY-VARIABLES-INSERT(L, init_offset) is equal to or less than 1024 (because the initial offset is a multiple of 4, we don’t need to insert dummy variable in the above situation). It is large enough in general cases.

List L passed to the procedure DUMMY-VARIABLES-INSERT(L, init_offset) is used in storing attributes of local array variables by the reverse order of declaration (i.e. the first element of list L stores the attributes of the rightmost array variable, and the last element of list L stores those of the leftmost one). As shown in Figure 3-2, each element of a linked list L is an object with a string field: var_name, two integer fields: sizeof_type and no_elements, and a pointer field: next. Given an element x in the list, var_name stores the name of the array variable, sizeof_type stores the size of the element of that, no_elements stores the number of the elements stored in this array, and next[x] points to its successor in the linked list. Besides, an attribute head[L] points to the first element of the list and an attribute length[L] stores the number of elements of the list.

Because compilers will allocate memory space on the top of stack, when the procedure

在文檔中針對嵌入式處理器之原始碼能量最佳化的實驗性分析 (頁 25-0)