Objective - 考量管線時間之延伸指令集

Chapter 1 Introduction

1.5 Objective

Considering pipestage timing in ISE exploration to reduce extra area cost of ISEs.

Chapter 2 Relative Works and Background

2.1 Relative Works

Instruction Set Extension (ISE) generation in the most works of [3, 4, 5, 6, 7, 9 and 13]

consists of ISE exploration and ISE selection.

ISE exploration

Authors in [3] propose an algorithm, called exact algorithm, to explore all possible ISE candidates such that it can be seen as an optimal solution. The exact algorithm maps the ISE search space, such as a basic block, to a binary tree and then discards some portion of the tree which violates predefined constraints. Nevertheless, this algorithm is highly computing-intensive so that it hardly processes a larger search space. For example, it must spend about one hour to process a search space consisting of only 30 instructions. To reduce the computing complexity, [3, 4] propose heuristic algorithms which are derived from the genetic algorithm and K-L algorithm respectively.

The work in [5] examines the impact of different constraints, such ISA format, hardware area and control flow, for ISE generation. These constraints would limit the performance improvement of the ISEs. ISA format limits the number of read and write ports to the register file. The limitation of control flow is whether the search space can cross basic block boundaries or not. In order to satisfy real-time constraints,

the search spaces are identified according to whether they locate on the worst-case execution path instead of execution time in [6]. This is because that the most frequently executed basic block or instruction may not contribute to the worst-case execution path. The granularity of each vertex within search space can be varied from one instruction to multiple subroutine calls in [13]. They also claim that one search space can consist of multiple basic blocks in their proposed algorithm.

From a different view point, [15] characterizes each basic block as a polynomial representation. At first, multiple-input single-output (MISO) algorithm extracts symbolic algebraic patterns from the search spaces and represents these patterns as polynomials on behalf of ISE candidates. Then these ISE candidates are mapped to the polynomial representations of program segments by symbolic algebraic manipulations.

ISE selection

The work in [7] transforms ISE selection as an area minimization problem. There have been many relative researches of the area minimization problem in the logic synthesis domain. [8] proposes another algorithm that uses divide-and-conquer search technique to solve ISE selection. To synchronize pipeline between CPU core and ASFU, [9] first adjusts the timing of CPU core to same with ASFU if the execution time of ASFU is larger than CPU core. Then, different number of ASFU’s pipeline stage, from one, two, three … until no performance improvement, are evaluated. At meanwhile, timing of CPU core is also adjusted with ASFU. Finally, the number of ASFU’s pipestage with best performance improvement is then chosen.

In addition, to reduce hardware cost, [14] adds new stages, called ISE combination,

between ISE exploration and ISE selection stages, to merge multiple similar ISE candidates together.

2.2 Background ─ Ant Colony Optimization (ACO)

Algorithm

Why Ant Colony Optimization Algorithm？

In order to indicate which part of a DFG is going to be ISE; the implementation of nodes should be decided. If we only consider the situation that there is only single hardware implementation option of a node, then there will be 2^N possible ISE patterns (legal or illegal) that N is the DFG size. When N is 100 (it’s a usually case), the combinations is emphatic 2¹⁰⁰！Obviously, this is a NP-hard problem. For the sake of an efficiently solution, the way of evolutionary computation which is operative to many existing NP-hard problems is considered.

There are many computation models belong to evolutionary computation, like genetic, simulated annealing, etc. One of them named “Ant Colony Optimization” is thought to be the easiest one to map to the problem. The selection among the models is processed by the difficulty of the mapping to the problem. An intuitive and easier mapping usually brings a simple and effective design of the algorithm.

One of the concepts of ACO is the selection a path among many choices (one or two or more) to get the shortest path. I think the selection among many different

implementation options of each node is just like that. This is the main reason that ACO outperforms other models. The only problem is how do the nodes

“communicate” to each other. The merit computation in the design takes it into account.

Basic Idea of Ant Colony Optimization Algorithm

Ant Colony Optimization algorithm is inspired by the behavior of ants in finding paths from the colony to food and has been extensively used to solve many optimization problems. Initially, ants wander randomly and lay down pheromone on the paths have been passed through. The density of the pheromone determines the probability of which path the next ant will pass through. Since the pheromone evaporates with the time, a shortest path gets marched over faster and thus has the higher density of pheromone. After a period of time, i.e. several iterations, more and more ants choose the shortest path such that the density of pheromone on this path grows increasingly. Finally, each ant almost chooses the shortest path and the pheromones of other paths evaporate to nearly zero.

Figure 2.2.1 is an example. Suppose 50 ants are in the ant colony. Now they are going to find food. There are two paths to get food. One is twice longer than the other. At t = 1, there is no pheromone on both paths. The ants choose paths with equal probability.

Suppose 25 ants choose one path, and 25 ants choose another. One ant leaves one unit of pheromone on the path. But the pheromone evaporates 5 units after t = 1. So the paths ant passed has 25 – 5 = 20 pheromone. At t = 2, ants start again. After t = 2, we can see the pheromone on each path segment. Next time, the right hand side path will be chosen by ants with higher probability than the left hand side path.

Ant Colony (50 ants) Ant Colony Ant Colony

Figure 2.2.1: An example of ant behavior

P=25→20 Before Start (t=0) Go (t=1) Evaporation (t=1)

Go (t=2) Evaporation (t=2) After (t=2) D = Distance, P = Pheromone

Chapter 3 ISE Exploration

In this paper, the purpose of ISE exploration is to find frequently executed instruction patterns as ISE candidates and evaluates all implementation options of each operation in ISE candidates to minimize the execution time with less silicon area. The input and output of ISE exploration algorithm are BBs and ISE candidates as well as their implementation option, respectively. Implementation option(s) of an operation represents its implementation method(s), and can be roughly divided into two categories, hardware and software.

The flow of ISE exploration is briefly described as follows: each input BB is first transformed to data flow graphs (DFG), and an implementation option (IO) table which represents all implementation options for an operation is appended to each operation in DFG. In this extended DFG, ISE exploration algorithm is repeatedly executed until no ISE candidate can be found. Note that ISE exploration algorithm only explores one ISE candidate at each round. A round usually consists of multiple iterations. Initially, ISE exploration algorithm chooses one implementation option in each operation according to a probability value (p). The probability value (p) is a function of pheromone and merit values. The meaning of pheromone is the same with the pheromone in the ACO algorithm, i.e. how many times an implementation option is chosen in previous iterations. The merit value represents the benefit of one implementation option being chosen. After making a choice, the pheromone value is updated. And then, the algorithm evaluates implementation option of each operation

in DFG, i.e. calculates their merit value, according to which implementation option is chosen in its neighboring ones at previous iteration. Above process are iteratively performed until the probability values (p) of all operations in DFG have exceeded a predefined threshold value, P_END.

3.1 Implementation option

According to profiling results, a BB with longer execution time is transformed to DFG. A DFG is represented by a directed acyclic graph G(V,E) where V is a set of vertices and E is a set of directed edges. Each vertex v∈V represents an assembly instruction, called “operation” hereafter in BB. Each edge (u,v)∈E from operation u to operation v indicates that the execution of operation v requires the data produced by operation u.

Each operation usually has multiple implementation options which can be divided into two categories, hardware and software. Hardware implementation option means this operation is included in an ISE and implemented in extra hardware, i.e. ASFU.

Due to different speed and area requirements, the operation usually has at least one hardware implementation option. On the other hand, software implementation option means this operation is executed in CPU core, and its execution time depends on the execution cycle count of each operation defined in CPU specification.

To represent all implementation options in a node, we add a table, called implementation option (IO) table, to each operation. Each entry in the IO table consists of one implementation option of the operation and its delay and area. Delay and area represent the execution cycle and the extra silicon area cost of this

implementation option, respectively. Obviously, using software implementation option for an operation requires one execution cycle at least but no extra silicon cost is introduced. On the other hand, using hardware implementation option can reduce execution cycle but consumes extra silicon area. After adding IO table to G, a new graph G⁺is generated. Figure 3.1.1 shows an example of G⁺. This example consists of two operations, are A and B. In this example, we assume the delay of software implementation option as one cycle.

Figure 3.1.1: An example of G⁺

3.2 Formulation of ISE Exploration

ISE exploration explores ISE candidates in G⁺ and evaluates all implementation options of each operation in ISE candidates. An ISE candidate in G⁺ is a subgraph S

⊆G⁺. The proposed ISE exploration can be formulated as follows:

ISE exploration: Given a graph G⁺, find S⊆G⁺ and evaluate all implementation options of vertex v∈S to minimize the execution cycle count with less silicon area under the following constraints:

1. IN(S) ≤ Nin, 2. OUT(S) ≤ Nout,

3. S is convex,

4. Load and store operations∉ S.

IN(S) (OUT(S)) represents the number of input (output) values used (produced) by Si. The user-defined values Nin and Nout indicate the register file read and write ports limitations, respectively. The convex constraint is that the ISE’s output can not connect to its input via other operations not grouped in ISE. In other words, if there exists no path from a operation u∈S to another operation v∈S which involves a operation w S, then S is convex. To conform to the limitation of RISC architecture and to degrade the complexity of the algorithm, load and store operations are prohibited from being grouped into ISE.

∉

In fact, if the limitation of EX and MEM stage in usually pipeline can be eliminated, the execution and memory access can take place with non-certain sequence, then load and store operations are possibly grouped into ISE. And it is reasonably to enhance the benefit of ISE

3.3 ISE Exploration Algorithm

The ISE exploration algorithm is driven from ACO algorithm. Conceptually, we can imagine that one entry in IO table, i.e. one implementation option, represents one or part of path from colony to food in ACO algorithm. Exploring ISE candidate with evaluating different implementation options is just like an ant finding the shortest path from colony to food.

Similar with ACO algorithm, which implementation option would be chosen depends on its probability value (p). The probability value (p) of each implementation option in an operation represents its probability of being chosen at each iteration of ISE exploration algorithm. On the other hand, choosing implementation option according to the probability value (p) can prevent local optimal solutions. The probability value (px,j) of j-th implementation option in operation x is a function of the pheromone and the merit values, as shown in equation (1). The meaning of the pheromone value is identical with the pheromone in the ACO algorithm. It reveals how many times an implementation option is chosen in previous iterations. Here, we denote the pheromone value of j-th implementation option of operation x by pheromonex,j in which pheromonex,0 is designated as the pheromone value of software implementation option. Just like the pheromone, the pheromone value must be updated after each iteration. The merit value is defined as the benefit of one implementation option being chosen, and it is calculated by the merit function which will be described in detail later. The merit value of j-th implementation option of operation x is denoted by meritx,j in which meritx,0 is designated as the merit value of software implementation option. The probability of j-th implementation option of operation x being chosen (px,j) is computed by:

where k is the number of hardware implementation options in operation x and α is used to determine the relative influence of pheromone and merit, and

, =

∑

= k j

px (3.2)

Figure 3.3.1 shows the proposed ISE exploration algorithm. Here, we assume that there are m (m > 0) operations in a DFG and each operation has n (n > 0) implementation options. Initially, i.e. in step 1, the algorithm sets initial values for the pheromone and merit values of each implementation option of all operations. Note that the initial merit value of hardware and software implementation options is different. This is because we wish that the algorithm has higher has more opportunity to choose hardware implementation option at the beginning of execution. In step 2, the algorithm checks operation x (x=1 to m) whether it has hardware implementation option. If yes, the algorithm chooses one among all implementation options in operation x according to their probability values (px,j); if no, it chooses software implementation option.

In step 3, ISE exploration algorithm updates the pheromone value of each implementation option j in operation x (x=1 to m) according to whether the implementation option j is chosen or not. The pheromone value of chosen implementation option is increased with ρ, a positive constant value, and others are decreased with ρ. The algorithm in step 4 calculates the merit value of each implementation options in operation x. Same as in step 2, the algorithm also first checks operation x (x=1 to m) whether it has hardware implementation option. If yes, the algorithm executes Hardware Grouping function that determines whether operation x can be grouped with its neighboring ones as a virtual ISE candidate, if it can, Hardware-Grouping function uses this virtual ISE candidate to calculate the execution time and silicon area of each hardware implementation option in operation

x. We will describe Hardware-Grouping function in detail later. And then, the merit value (meritx,j) of implementation option j (j=1 to n) in operation x is generated by using merit function. Finally, ISE exploration algorithm checks the end condition in step 5. If the end condition is not satisfied, ISE exploration algorithm returns to step 2 and enters the next iteration; else, it terminates.

1. (Initialization)

For implementation option j (j=0 to n) of operation x (x=1 to m) in DFG pheromonex,j = 0;

If (j=0)

meritx,0 = initial value of software implementation option;

Else

meritx,j = initial value of hardware implementation option;

2. (Calculating probability value (p) and choosing implementation option) For operation x (x=1 to m)

If (x has hardware implementation option)

For implementation option j (j=0 to n) in operation x Calculate px,j;

Choose one implementation option according to its probability value (p);

Else

Choose software implementation option;

3. (Pheromone update)

For implementation option j (j=0 to n) of operation x (x=1 to m) in DFG If the implementation option is selected

4. (Calculating merit)

For operation x (x=1 to m)

If (x has hardware implementation option)

For implementation option j (j=1 to n) in operation x Execute Hardware_Grouping;

Calculate meritx,j; 5. (Terminating condition)

If not (end_condition) gotostep 2;

Figure 3.3.1: ISE Exploration Algorithm

The end condition is that for all operations in DFG, the probability value (p) of one of implementation options exceeds P_END which is a predefined threshold value and

very closed to 100%. A larger P_END have greater opportunity to obtain better result, but it needs longer convergence time, i.e. takes more computing time. An implementation option with the probability value (p) over P_END is called taken implementation option. A single ISE candidate is a group of connected operations in the DFG which all have taken hardware implementation option.

Hardware-Grouping

If the operation x has hardware implementation option, a function, called Hardware-Grouping, must be executed before computing the merit value of each hardware implementation option. Hardware-Grouping checks whether the operation x can be grouped with its neighboring ones as a virtual ISE candidate. It recursively groups operation x with neighboring ones which have chosen hardware implementation option in previous iteration as a virtual ISE candidate, i.e. a virtual subgraph vSx. Here, we denote the result of Hardware-Grouping of operation x using j-th implementation option by vSx,j. Note that vSx,0 is meaningless due to 0-th implementation option is software one. Using the vSx,j, Hardware-Grouping evaluates the execution time and silicon area of vSx,j. Note that the execution time of vSx,j is the critical path time in vSx,j and the silicon area of vSx,j is the sum of silicon area of vSx,j.

We use figure 3.3.2 to explain how the Hardware-Grouping operates. The table in figure 3.3.2 represents delay and area of each implementation option for all operations and specifies the chosen implementation option in previous iteration. In both top and bottom left of figure 3.3.2, nodes in dotted line are treated as a virtual ISE candidate.

For operation #2, Hardware-Grouping groups operation #2 and #3 as a virtual ISE candidate, i.e. vS2, as shown in the top left of figure 3.3.2. Since only one hardware implementation option exists in operation #2, vS2 has one evaluating result in

execution time and silicon area (execution time = 0.8, silicon area = 1200). The bottom left of figure 3.3.2 is another example in which Hardware-Grouping groups operation #5 and its neighboring ones, are #2, #3, #6 and #7, as a virtual ISE candidate, i.e. vS5. Since operation #5 has two hardware implementation options, there are two evaluating results in vS5, one is vS5,1 (execution time = 1.7, silicon area = 2400) and another is vS5,2 (execution time = 1.4, silicon area = 3000).

Figure 3.3.2: Examples of Hardware-Grouping

Merit Function

The purpose of merit function is to calculate the benefit, i.e. merit value, of implementation option. Briefly, the merit function consists of three cases: size checking (case 1), constraints violation determination (case 2) and benefit calculating

Hardware grouping of operation #5 Hardware grouping of operation #2

Operation

Option Delay Area

1 ● software 1 0

(case 3). Figure 3.3.3 depicts the algorithm of merit function. Initially, in the case 1, the algorithm checks whether size(vSx,j), is denoted as the number of operation in vSx,j, is equal to one. If yes, since there is only one operation, i.e. operation x, in vSx,j, it is impossible to improve performance, so that the algorithm adjusts the merit value to decrease the chance of choosing hardware implementation option, this comparatively rises the choosing probability of software implementation option. And then, the calculation of merit function is terminated. Note that in this paper, we assume each operation is one-cycle delay. If multiple-cycle delay is assumed, case 1 may be tailored to fit the situation. If no, goto case 2.

The case 2 checks whether vSx violates input/output port and convex constraints. If yes, the merit value of each hardware implementation option is multiplied by a constant β1, β2 or β3 (0 < β1 < 1, 0 < β2 < 1 and 0 < β3 < 1). This relatively reduces the opportunity of selection of software implementation option just the same as in case 1.

And then, the calculation of merit function is terminated. The reason why we only divide the merit value of each hardware implementation option in operation x by a constant rather than exclude the possibility of operation x becoming an ISE candidate is that operation x will have an opportunity to be grouped as an ISE candidate in the next iteration. If no, enter case 3.

In the case 3, the merit value of j-th implementation option (meritx,j, j > 0) in the operation x is calculated according to (1) how much speed up can be achieved by vSx,j; or (2) the extra area used by vSx,j. The execution time, cycle reduction and silicon area

在文檔中考量管線時間之延伸指令集 (頁 12-0)