Experimental Framework - 針對嵌入式處理器之原始碼能量最佳化的實驗性分析

Chapter 4 Experiments

4.1 Experimental Framework

EMSIM 2.0 energy simulator of StrongARM is adopted as a part of our experimental framework to get the energy information of the experiments.

In addition to energy consumption, we need other information to analyze and evaluate side effects when every time simulation functions run. Because the main impact factors on energy consumption include CPU cycles consumed, and the number of instruction and data cache misses, we modify EMSIM energy simulator to get such information. The addresses of instruction and data cache miss are outputted to verify the correctness of cache misses. In addition, in order to get more accurate information about cache misses, we detect if

simulation function will be executed. Before being executed, we flush instruction and data caches. The execution results of EMSIM before and after modified are shown in Figure 4-1.

And the information about function code size is got by using arm-linux-objdump program of the GNU Binutils which are a collection of binary tools.

Figure 4-1 The execution results of EMSIM

Since the EMSIM simulation framework is about running the Linux OS in a StrongARM simulator, several Linux and StrongARM related components are needed [31], including Linux OS kernel and the ARM toolchain. Linux 2.4.18 and patch-2.4.18-rmk3 are used in building our Linux kernel. Besides, the ARM toolchain which we build are listed in Table 4-1.

Table 4-1 The ARM toolchain

binutils 2.11 gcc 2.95.3 glibc 2.2.3 The ARM toolchain

glibc-linuxthreads 2.2.3

Step one :

Original C source code file

The SUIF compiler

c2suif suif2c

The SUIF system

IR IR

Transformed C source code file Transform manually

Executing shell script “run_batch＂ to repeat the above process for every C source code file in the same directory

Linux kernal

Figure 4-2 The overall experimental framework

Our experiments for every transformation involve three steps, comprised of generating C source code files in the same directory, generating output files which record execution results of EMSIM and arm-linux-objdump, and executing “Energy Report” program to profile energy consumption and side effects by parsing output files. The overall experimental framework is shown in Figure 4-2. In step one, we firstly design an original C source code file and use two ways to generate transformed ones, including using SUIF passes which do transformations, and transforming manually. In step two, in order to get the simulation results for every source code file, we write a shell script, namely “run_batch”, to repeat the process of generating

executable program, running simulation and writing out information to output file. In the last step, an “Energy Report” program is designed to parse output files, and to calculate and show the results in GUI, as shown in Figure 4-3.

Figure 4-3 The execution result of the Energy Report program

Transformations in source code level lead to very different results depending on a number of factors, including the specific structure of code, the target architecture, and the parameters of the transformations, etc. [16]. But what compiler you use and of which the options you choose also have a significant impact. Because the EMSIM energy simulator is adopted, the target architecture is fixed in our experimental framework. Table 4-2 shows the target architecture in our experimental framework. In addition, we adopt gcc 2.95.3 as our cross compiler and use optimization level -O0 and option -fomit-frame-pointer to compile the

source code files of our experiments. Hence, the experiments of the transformations are designed to observe the results between the specific structure of code and the different parameters of the transformations in this thesis.

Table 4-2 The target architecture of our experimental framework

The component of the target architecture Description

The processor StrongARM SA-1100 (CPU clock: 206 MHz)

The cache architecture 32-way set-associative instruction cache:

Cache size: 16 KB Cache line size: 32 bytes Replacement policy: round robin 32-way set-associative write-back data cache:

Cache size: 16 KB Cache line size: 32 bytes Replacement policy: round robin

4.2 Data Transformations

In this section, the experiment of only one data transformation which is common sub-expression elimination is designed to verify its energy-efficiency.

4.2.1 Common Sub-expression Elimination

Figure 4-4 C source code of common sub-expression elimination for Exp#1.1

We use the JuanCSE pass [36] which is free released to do common sub-expression elimination to transform the original C source code. The C source code before and after transformation are shown in Figure 4-4.

4.3 Loop Transformations

In this section, a number of experiments of loop transformations are designed to verify their energy-efficiency, including loop fusion, loop fission, loop reversal, loop inversion, loop interchange, loop unrolling and loop unswitching.

4.3.1 Loop Fusion

Figure 4-5 C source code of loop fusion for Exp#2.1

As shown in Figure 4-5, we design an experiment in which the simulation function has two loops. After adding the code size of the two loop bodies, we find that the size is smaller than the size of the instruction cache, so we can use loop fusion transformation to reduce the loop overhead but not increase the number of the instruction cache misses. Hence, it is expected that the energy consumption is reduced after transformation.

4.3.2 Loop Fission

In our experimental framework, the size of the instruction cache is 16 KB. As shown in Figure 4-6, we design a simple experiment in which the loop body size in the simulation

function of the original C source code is larger than 16 KB, so there are not only compulsory misses but also capacity misses happened when the loop is executed. We then apply loop fission transformation to the loop of the original C source code to break down the loop into two ones of which the body size is a half of the original one. It is expected that after transformation, the number of the instruction cache misses are reduced remarkably resulting in reduced energy consumption.

Figure 4-6 C source code of loop fission for Exp#2.2

4.3.3 Loop Reversal

In section 3.3.3, it is expected that energy consumption will be reduced because loop reversal causes the original ADD/CMP instruction pair to be replaced by a single SUBS instruction on the ARM. We design a simple experiment to verify its hypothesis, as shown in Figure 4-7.

Figure 4-7 C source code of loop reversal for Exp#2.3

4.3.4 Loop Inversion

As shown in Figure 4-8, Exp#2.4 is designed to verify energy-efficiency of loop inversion. By transforming a for loop to a do-while loop, it is expected to reduce the number of the instructions executed, leading to reduced energy consumption.

Figure 4-8 C source code of loop inversion for Exp#2.4

4.3.5 Loop Interchange

Original C source code Transformed C source code void sim_func(void) {

int i, j;

char a[16][512];

for (j=0;j<=511;j++) for (i=0;i<=15;i++) a[i][j]=41;

}

void sim_func(void) { int i, j;

char a[16][512];

for (i=0;i<=15;i++) for (j=0;j<=511;j++) a[i][j]=41;

}

Figure 4-9 C source code of loop interchange for Exp#2.5.a

This transformation is very useful to reduce conflict misses. In our experiments, we focus on the case in which the array size is smaller than the size of the data cache in our target architecture. Figure 4-9 shows the first experiment of this transformation. When executing the loop of the original program of Exp#2.5.a, it will result in compulsory misses in the same set of the data cache because of the row size of the two dimensional array is 512 Bytes. Conflict

misses will not be happened, because the number of the columns in this array is equal to 16 which is smaller than the number of the ways. Hence, it is expected that the energy consumption will not be changed after transformation.

In the second experiment, we use the same row size of the array as the first experiment, as shown in Figure 4-10. But the number of the columns is greater than the number of the ways; as a result, it will result in conflict misses when the 33rd column of the array is accessed in the original loop. Hence, it is useful to do loop interchange to eliminate the conflict misses resulting in reduced energy consumption.

Original C source code Transformed C source code void sim_func(void) {

Figure 4-10 C source code of loop interchange for Exp#2.5.b

4.3.6 Loop Unrolling

This transformation has a parameter, the unrolling factor. Because the unrolling factor has a significant impact on energy consumption, we design a C source code file generator program to generate 100 files of which the unrolling factor are 1 to 100 respectively.

As shown in Figure 4-11, the left is the original C source code of which the unrolling factor is 1 and the right is the transformed C source code of which the unrolling factor is U (2~100). It is noted that when U is not a divisor of the loop counts, it is necessary to copy the original loop and put the duplicated loop below the original one to complete the operations;

otherwise, the loop copy can be eliminated.

Original C source code Transformed C source code

Figure 4-11 C source code of loop unrolling for Exp#2.6

4.3.7 Loop Unswitching

Figure 4-12 C source code of loop unswitching for Exp#2.7.a

Original C source code Transformed C source code void sim_func(void) {

Figure 4-13 C source code of loop unswitching for Exp#2.7.b

It is expected that this transformation reduces energy consumption, but it may increase code size due to the loop copy for else parts of conditional. We design two experiments for this transformation to evaluate the impact on the code size. One has only if part and another has else part, as shown in Figure 4-12 and Figure 4-13, respectively.

4.4 Control Structures and Operators Transformations

In this section, the conditional sub-expression reordering of the control structures and operators transformations is discussed.

4.4.1 Conditional Sub-expression Reordering

This transformation is very simple to understand its operational principle. It reorders the conditional sub-expressions by their probability to reduce the number of the instructions executed. Hence, it is expected that the energy consumption will be reduced.

Original C source code Transformed C source code void sim_func(void) {

Figure 4-14 C source code of conditional sub-expression reordering for Exp#3.1

4.5 Procedural Transformations

A number of experiments of the procedural transformations which include procedure inlining, procedural integration and loop embedding are designed to evaluate their energy-efficiency in this section.

4.5.1 Procedure Inlining

Original C source code Transformed C source code int x, y;

int sim_funcA(void) { int result;

result=x*2+21;

result=result+y*+41;

result=result*result+2;

result=result/2;

int suif_tmp0, suif_tmp1, suif_tmp2;

x=210;

Figure 4-15 C source code of procedure inlining for Exp#4.1

As shown in Figure 4-15, the simulation function in the original C source code has four call sites at which the function ‘sim_funcA’ is invoked. We use the JuanInlining pass [36]

which is free released to do procedure inlining to transform the original C source code.

4.5.2 Procedure Integration

Figure 4-16 C source code of procedure integration for Exp#4.2

Procedure integration is a general version of procedure inlining. It can decide which call site to do integration. As shown in Figure 4-16, we design Exp#4.2 which uses the same original file as Exp#4.1. But we only select the call site which is in the loop to do integration.

Hence, it is expected that this transformation not only reduce energy consumption effectively but also control the increased code size within a reasonable range.

4.5.3 Loop Embedding

This transformation is expected to reduce energy consumption. We design a simple experiment to verify its energy-efficiency.

Figure 4-17 C source code of loop embedding for Exp#4.3

4.6 ISA-specific Transformations

In this section, a number of experiments of ISA-specific transformations are designed to verify their energy-efficiency on ARM ISA.

4.6.1 Arrays Declaration Sorting

We design two experiments to observe the results on ARM ISA. Figure 4-18 and Figure

4-19 show the original and transformed source code for the two experiments respectively.

Original C source code Transformed C source code void sim_func(void) {

int i, a[305], b[210], c[110];

for (i=0;i<305;i++) a[i]=41;

for (i=0;i<210;i++) b[i]=41;

for (i=0;i<110;i++) c[i]=41;

}

void sim_func(void) { int i, b[210], c[110], a[305];

for (i=0;i<305;i++) a[i]=41;

for (i=0;i<210;i++) b[i]=41;

for (i=0;i<110;i++) c[i]=41;

}

Figure 4-18 C source code of arrays declaration sorting for Exp#5.1.a

Figure 4-19 C source code of arrays declaration sorting for Exp#5.1.b

4.6.2 Dummy Variables Insertion

As shown in Figure 4-20, this transformation insert dummy variables to reduce the number of the instructions executed to calculate the bases addresses of the arrays; as a result, it is expected to reduce energy consumption on ARM ISA.

Figure 4-20 C source code of dummy variables insertion for Exp#5.2

4.6.3 Arrays Declaration Permutation

As shown in Figure 4-21, this transformation suitably permutes the order of the arrays declaration to reduce the number of the instructions executed to calculate the bases addresses of the arrays; as a result, it is expected to reduce energy consumption on ARM ISA.

Figure 4-21 C source code of arrays declaration permutation for Exp#5.3

Chapter 5 Results and Analyses

In this chapter, we list the results of the experiments which are designed in Chapter 4.

From the results, we try to analyze the relationship between energy consumption and side effects such as code size and performance.

Table 5-1 The definition of notations

Notation Definition

⎢ ⎥x

⎣ ⎦ The greatest integer less than or equal to x

⎡ ⎤x

⎢ ⎥ The least integer greater than or equal to x U Unrolling factor

LO Loop overhead

LB Loop body

ICM Instruction cache miss DCM Data cache miss NLC The loop counts

NDCM The number of data cache misses

NDCM’ The number of data cache misses after transformation SICL The size of the instruction cache line

SLO The loop overhead size SLB The loop body size

ELO The energy consumption of the instructions executed for the loop overhead ELB The energy consumption of the instructions executed for the loop body Ecmp The energy consumption of the compare operation before transformation Ecmp’ The energy consumption of the compare operation after transformation EICM The energy consumption of memory access for every instruction cache miss EDCM The energy consumption of memory access for every data cache miss Eori The energy consumption of the affected code before transformation Eaft The energy consumption of the affected code after transformation

△E The energy consumption savings after transformation (△E=Eaft-Eori)

In order to simplify our analyses, we take assumptions for some transformations. After analyses, we will show the limitations of transformations resulting from compilers and ISA if necessary. We also create energy equations to express the energy consumption savings for transformations if possible. In our energy equations, we only consider the three main factors on energy consumption, including the energy consumption of the instructions executed and the energy consumption of memory access for the instruction and data cache misses. Table 5-1 lists the definition of notations used in the following sections.

5.1 Data Transformations

In this section, the experimental results of the data transformations are listed and analyzed.

5.1.1 Common Sub-expression Elimination

Table 5-2 The result of Exp#1.1

Parameters Original Transformed %

Code Size (bytes) 80 76 -5.00

Instruction Cache Misses 4 3 -25.00

Data Cache Misses 1 1 0.00

CPU Cycles 308 240 -22.08

Energy Consumption (nJ) 747 613 -17.92

Table 5-3 The definition of notations used in Section 5.1.1

Notation Definition NCSE The number of common sub-expression

ECSE The energy consumption of the instructions executed for the common sub-expression ELDR_CSE The energy consumption of loading the computation result of the common sub-expression ESTR_CSE The energy consumption of storing the computation result of the common sub-expression

As shown in Table 5-2, the number of the data cache misses is not affected. But it is noted that it may result in the increased number of the data cache misses due to the introduced variables used in storing the computation result of common sub-expressions. In addition, the code size and the number of the instruction cache misses are reduced due to the reduced number of the instructions for recomputing; as a result, the CPU cycles consumed and the energy consumption are reduced. Table 5-3 lists the definition of notations used in this section.

The energy consumption savings can be expressed as:

Eq. (5.1)

_ _

(1 CSE) CSE CSE LDR CSE STR CSE

E N E N E E

Δ ≈ − × + × +

5.2 Loop Transformations

In this section, the experimental results of the loop transformations are listed and analyzed.

5.2.1 Loop Fusion

Assumption: the size of the loop which is the fusion of several loops is equal to or less than

the size of the instruction cache.

Table 5-4 The result of Exp#2.1

Parameters Original Transformed %

Code Size (bytes) 176 124 -29.55

Instruction Cache Misses 7 5 -28.57

Data Cache Misses 103 103 0.00

CPU Cycles 20241 14047 -30.60

Energy Consumption (nJ) 36288 25845 -28.78

As shown in Table 5-4, the number of the data cache misses is not affected. The code size and the number of the instruction cache misses are reduced due to the reduced number of

the instructions of the loop overheads; as a result, the CPU cycles consumed and the energy consumption are reduced. Table 5-5 lists the definition of notations used in this section. The energy consumption of the loops before and after transformation can be expressed as follows:

Because N^DCM’ is equal to N^DCM, the energy consumption savings can be derived as:

(1 )

Table 5-5 The definition of notations used in Section 5.2.1

Notation Definition LBi The loop body of the ith loop

Nl The number of the loops which have the same loop counts SLBi The loop body size of the ith loop

ELBi The energy consumption of the instructions executed for the loop body of the ith loop

5.2.2 Loop Fission

Assumption: the size of the loop before transformation is greater than the size of the

instruction cache, and the size of the every loop after transformation is equal to or less than the size of the instruction cache.

As shown in Table 5-6, the number of the data cache misses is not affected. The code size is increased due to the increased number of the instructions of the loop overheads. It is very noted that the number of the instruction cache misses are reduced remarkably due to the

reduced number of the capacity misses; as a result, the CPU cycles consumed and the energy consumption are reduced.

Table 5-6 The result of Exp#2.2

Parameters Original Transformed %

Code Size (bytes) 21100 21168 0.32

Instruction Cache Misses 270193 663 -99.75

Data Cache Misses 206 206 0.00

CPU Cycles 8917588 4606762 -48.34

Energy Consumption (nJ) 20210333 7526457 -62.76

5.2.3 Loop Reversal

As shown in Table 5-7, the code size is almost unchanged after loop reversal transformation. The number of the instruction and the data cache misses are not affected.

From the assembly code files, we find that the CPU cycles consumed are reduced because comparing with zero only needs one instruction and comparing to another value may need more than one instruction; as a result, the energy consumption may be reduced or not.

However, it is noted that this transformation may reduce the dependence of the codes in loop body to make other transformations applied. The energy consumption savings can be express

as: Δ ≈E E_cmp'−E_cmp Eq. (5.5)

Table 5-7 The result of Exp#2.3

Parameters Original Transformed %

Code Size (bytes) 116 112 -3.45

Instruction Cache Misses 5 5 0.00

Data Cache Misses 101 101 0.00

CPU Cycles 13315 12514 -6.02

Energy Consumption (nJ) 24667 23379 -5.22

Limitation: it is affected by the compiler and the ISA used if you want to reduce energy

consumption.

5.2.4 Loop Inversion

As shown in Table 5-8, the code size and the number of the instruction cache misses are almost unchanged after loop inversion transformation. The number of the data cache misses is not affected. The CPU cycles consumed are reduced due to the reduced compare and branch instructions; as a result, the energy consumption is reduced.

Table 5-8 The result of Exp#2.4

Parameters Original Transformed %

Code Size (bytes) 80 76 -5.00

Instruction Cache Misses 4 3 -25.00

Data Cache Misses 27 27 0.00

CPU Cycles 4711 4685 -0.55

Energy Consumption (nJ) 8599 8446 -1.77

5.2.5 Loop Interchange

As shown in Table 5-9 and Table 5-10, the code size and the number of the instruction cache misses are not affected after loop interchange transformation. In Exp#2.5.a, as the same as expected, the number of the data cache misses is not affected due to no changes in the

在文檔中針對嵌入式處理器之原始碼能量最佳化的實驗性分析 (頁 43-0)