Objective - 多重配送處理器架構下的延伸指令集探索

Chapter 1 Introduction

1.6 Objective

Design an ISE Exploration algorithm by consider the operations in critical path to generate ISEs to reduce execution time in multiple-issue architectures.

Chapter 2 Relative Works and Background

2.1 Relative Works

ISE design flow comprises application profiling, basic block selection, ISE (candidate) exploration, ISE (candidate) merging, ISE selection as well as hardware sharing, and ISE replacement. After application profiling, basic block(s) is selected as the input of ISE exploration based on their execution time. ISE exploration explores legal instruction pattern as ISE candidate, which have to conform to predefined constraints [4, 5, 6, 7, 8 and 13], e.g.

pipestage timing, instruction set architecture (ISA) format, silicon area and register file. In ISE merging stage, the algorithm merges the ISE B into ISE A, if ISE B is a subgraph of ISE A.

After executing ISE merging, ISE selection chooses as many ISEs as possible to attain the highest performance improvement under predefined constraints [9, 10, 11, 12 and 13], such as silicon area and ISA format. To achieve higher hardware utilization, hardware sharing is also performed at this stage (ISE selection). Strictly speaking, the results of both ISE (candidate) exploration and ISE (candidate) merging are ISE candidate(s). But for the sake of simplicity, ISE candidate is sometimes called ISE. In addition, because we only focus on ISE exploration in this paper, the algorithms of other steps do not be addressed, and these can be referred in the [8, 9, 10, 11, 12 and 13].

Pozzi [4] proposed an algorithm to examine all possible ISE candidates such that it can obtain an optimal solution. This maps the ISE search space, such as a basic block, to a binary tree, and then discards some portion of the tree that violates predefined constraints. Nevertheless, this algorithm is highly computing-intensive, so does not process a larger search space. For

instance, if a basic block has N operations, and each operation has only one hardware implementation option, then it has 2^N possible ISE patterns (legal or illegal). Notably, one ISE candidate may consists of one or multiple legal ISE pattern(s). When N = 100 (the standard case), then the number of possible ISE patterns is 2¹⁰⁰. Obviously, this number of patterns cannot be computed in a reasonable time. To decrease the computing complexity, heuristic algorithms derived from genetic algorithm [4], Kernighan-Lin (KL) [5], greedy-like algorithm [6] and ant colony optimization algorithm [8] have been developed. An Integer Linear Programming formulation of the ISE exploration was presented in [7]: in this case, the enumeration of subgraphs is implicit in the formulation’s constraints, and the worst-case complexity is still exponential. Nevertheless, all algorithms [4, 5, 6, 7 and 8]only consider the legality of operations when exploring ISE

2.2 Background ─ ─ ─ ─ Ant Colony Optimization (ACO) Algorithm

Why Ant Colony Optimization Algorithm？？？？

In order to indicate which part of a DFG is going to be ISE; the implementation of nodes should be decided. If we only consider the situation that there is only single hardware implementation option of a node, then there will be 2^N possible ISE patterns (legal or illegal) that N is the DFG size. When N is 100 (it’s a usually case), the combinations is emphatic 2¹⁰⁰！Obviously, this is a NP-hard problem. For the sake of an efficiently solution, the way of evolutionary computation which is operative to many existing NP-hard problems is considered.

There are many computation models belong to evolutionary computation, like genetic, simulated annealing, etc. One of them named “Ant Colony Optimization” is thought to be the easiest one to map to the problem. The selection among the models is processed by the difficulty of the mapping to the problem. An intuitive and easier mapping usually brings a

simple and effective design of the algorithm.

One of the concepts of ACO is the selection a path among many choices (one or two or more) to get the shortest path. I think the selection among many different implementation options of each node is just like that. This is the main reason that ACO outperforms other models. The only problem is how do the nodes “communicate” to each other. The merit computation in the design takes it into account.

Basic Idea of Ant Colony Optimization Algorithm

Ant Colony Optimization algorithm [1 and 2] is inspired by the behavior of ants in finding paths from the colony to food and has been extensively used to solve many optimization problems. Initially, ants wander randomly and lay down pheromone on the paths have been passed through. The density of the pheromone determines the probability of which path the next ant will pass through. Since the pheromone evaporates with the time, a shortest path gets marched over faster and thus has the higher density of pheromone. After a period of time, i.e.

several iterations, more and more ants choose the shortest path such that the density of pheromone on this path grows increasingly. Finally, each ant almost chooses the shortest path and the pheromones of other paths evaporate to nearly zero.

Figure 2.2.1 is an example. Suppose 50 ants are in the ant colony. Now they are going to find food. There are two paths to get food. One is twice longer than the other. At t = 1, there is no pheromone on both paths. The ants choose paths with equal probability. Suppose 25 ants choose one path, and 25 ants choose another. One ant leaves one unit of pheromone on the path. But the pheromone evaporates 5 units after t = 1. So the paths ant passed has 25 – 5 = 20 pheromone. At t = 2, ants start again. After t = 2, we can see the pheromone on each path segment. Next time, the right hand side path will be chosen by ants with higher probability

than the left hand side path.

Figure 2.2.1: An example of ant behavior

P=25→20

Before Start (t=0) Go (t=1) Evaporation (t=1)

Go (t=2) Evaporation (t=2) After (t=2)

D = Distance, P = Pheromone

Chapter 3 ISE Exploration

In this paper, the purpose of ISE exploration is to find frequently executed instruction patterns as ISE candidates and evaluates all implementation options of each operation in ISE candidates to minimize the execution time with less silicon area. The input and output of ISE exploration algorithm are BBs and ISE candidates as well as their implementation option, respectively. Implementation option(s) of an operation represents its implementation method(s), and can be roughly divided into two categories, hardware and software.

The flow of ISE exploration is briefly described as follows: each input BB is first transformed to data flow graphs (DFG), and an implementation option (IO) table which represents all implementation options for an operation is appended to each operation in DFG. In this extended DFG, ISE exploration algorithm is repeatedly executed until no ISE candidate can be found. Note that ISE exploration algorithm only explores one ISE candidate at each round.

A round usually consists of multiple iterations. Initially, ISE exploration algorithm chooses one implementation option in each operation according to a probability value (p). The probability value (p) is a function of pheromone and merit values. The meaning of pheromone is the same with the pheromone in the ACO algorithm, i.e. how many times an implementation option is chosen in previous iterations. The merit value represents the benefit of one implementation option being chosen. After making a choice, the pheromone value is updated. And then, the algorithm evaluates implementation option of each operation in DFG, i.e. calculates their merit value, according to which implementation option is chosen in its neighboring ones at previous iteration. Above process are iteratively performed until the

probability values (p) of all operations in DFG have exceeded a predefined threshold value, P_END.

3.1 ISE Design flow

The ISE design flow, as illustrated in Figure 3.3.1, comprises application profiling, basic block selection, ISE (candidate) exploration, ISE (candidate) merging, ISE selection and hardware sharing as well as ISE replacement and instruction scheduling. After application profiling, basic block(s) is selected as the input of ISE exploration based on their execution time. ISE exploration explores legal instruction pattern as ISE candidate, which have to conform to predefined constraints [4, 5, 6, 7, 8 and 13], e.g. pipestage timing, instruction set architecture (ISA) format, silicon area and register file. If only one ISE is explored, then the algorithm directly enters final stage (ISE replacement and instruction scheduling); otherwise, the algorithm goes to next stage (ISE merging). In ISE merging stage, the algorithm merges the ISE B into ISE A, if ISE B is a subgraph of ISE A. To avoid unnecessary performance degradation, the merging process is performed if the following conditions are satisfied: (1) the execution cycle of ISE B is equal or larger than that of the identical subgraph (identical to ISE B) in ISE A, and (2) ISE A and ISE B do not be executed simultaneously. After generating ISE candidates, ISE selection chooses as many ISEs as possible to attain the highest performance improvement under predefined constraints [9, 10, 11, 12 and 13], such as silicon area and ISA format. To achieve higher hardware utilization, hardware sharing is also performed at this stage. Hardware sharing is the assignment of a hardware resource to more than one operation within different ASFUs. Same with ISE merging, hardware sharing also follows the same rules as described above to avoid performance degradation. Finally, the ISE replacement is performed to discover all instruction patterns (i.e. subgraphs) in the DFG that match selected ISEs, prioritizes these matches and replaces the matches with ISEs. Strictly

speaking, the results of both ISE candidate exploration and ISE candidate merging are ISE candidate(s). But for the sake of simplicity, ISE candidate is usually called ISE. Hence, in this paper, we use ISE to replace ISE candidate. In addition, because we only focus on ISE exploration in this paper, the algorithms of other steps would not be addressed, and these can be referred in the [8, 9, 10, 11, 12 and 13].

Figure 3.1.1: ISE design flow

3.2 How to apply ACO algorithm to ISE exploration

ISE exploration in multiple-issue processor is to choose an implementation option for each operation and determine the execution order of operation. Exploring ISE in a DFG can be viewed as a search in the space of possible or feasible solutions. Here, the solution means a

set of ISE candidate found in a DFG. To apply ACO algorithm, the search space is organized as a search tree. A path from root to leaf in the search tree is considered as one of possible or feasible solutions. After constructing the search tree, we place ant colony and food at root and leaf of search tree, respectively, and let ants make decision (choose an implementation option, and select one succeeding operation if need) level by level to construct the solutions.

Selecting the shortest path from ant colony to food can be viewed as similar to choosing the best implementation option (hardware or software) and determining the optimal execution order for all operations.

Figure 4 is an example to illustrate above concept. The leaf hand of Fig. 4 shows the dependence of O1, O2 and O3, the search tree is depicted at the right hand of Fig. 4. In this example, we assume that there are three operations, namely O1, O2 and O3, and each operation has two hardware (H1 and H2) and two software (S1 and S2) implementation options. Since the possible execution order for operation O1, O2 and O3 are O1O203 and O1O302, respectively, there exist two paths after choosing one implementation option at O1.

Figure 3.2.1: Apply ACO to ISE exploration

Chapter 4 ISE Exploration in Multiple-Issue Architecture

The input and output of ISE exploration algorithm are selected basic block(s) and ISE candidate(s) as well as its (their) hardware implementation options, respectively. Figure 4.0.1 is an example. Before exploring ISE, a basic block must be transformed to a data flow graph (DFG). DFG is represented by a directed acyclic graph G(V,E) where V denotes a set of vertices, and E represents a set of directed edges. Every vertex v∈V is an assembly instruction, called an “operation” or “node” hereafter in basic block. Each edge (u,v)∈E from operation u to operation v signifies that the execution of operation v needs the data generated by operation u.

ISE exploration aims to determine which implementation option should be used by which operation. As mentioned early, if the operations locating on the non-critical path are packed into ISE, then there does not only not improve performance, but also waste silicon area. To avoid this situation, the algorithm must identify which operation locates on the critical path before starting to encapsulate operations into ISE.

Exploring ISE in multiple-issue architecture is to assign each operation in DFG a time slot and an implementation option such that execution time is minimal, and under that, consumes less silicon area. In Fig. 4.0.1, we assume that the issue width of processor is two, and that each operation has only one hardware/software implementation option. After exploring, operation 3 and 5 as well as operation 6, 7 and 8 choose hardware implementation option,

while other operations select software one. ISE is a set of connected/reachable operations that all use hardware implementation option. In Fig. 4.0.1, there are two ISEs in which one consists of operation 3 and 5; another one includes operation 6, 7 and 8.

6 until no ISEs in a DFG can be explored:

Step 1: Identify the critical path using instruction scheduling and explore ISE to reduce the length of the critical path.

Step 2: Evaluate the result of this iteration and calculate the benefit of all implementation options of operations for next iteration.

To explain this process, an example is depicted in figure 4.0.2. All assumptions are same with Fig.4.0.1. In step 1, the algorithm identifies the critical path (1468 and 1478) by scheduling instructions, and packs legal operations (6, 7 and 8) into ISE. After generating a new ISE (consists of 6, 7 and 8), all implementation options of operations are evaluated.

However, this process is not shown in Fig. 4.0.2. Same with step 1, in step 2, the algorithm also schedules all instructions (including ISE and normal instructions) to distinguish which

path is critical, and then encapsulates the operations (3 and 5) locating the critical one into ISE. After that, evaluation process is performed again. In step 3, since no valid operation can be found, the algorithm is terminated. The valid operation means that packing this operation into ISE can have performance gain.

6 usually has multiple implementation options, which can be divided into two categories, namely hardware and software. If an operation is encapsulated into ISE, it means that this operation deploys the hardware implementation option; on the contrary, if not encapsulated, this operation is executed in the processor core. Because of different speed and area requirements, most operations usually have multiple hardware implementation options.

To represent all implementation options for an operation, a table, called implementation option (IO) table, is added to every operation. Each entry in the IO table comprises three fields, namely implementation option, delay and area. The name of implementation option is shown in implementation option field. The delay and area denote the execution time and the extra silicon area cost of one implementation option, respectively. A new graph G⁺is

generated after the IO table is added to G. Figure 4.1.1 shows an example of G⁺, consisting of

ISE exploration explores ISE candidates in G⁺. An ISE candidate in G⁺ is a subgraph S

⊆G⁺. The proposed ISE exploration can be formulated as follows.

ISE exploration: Considering a graph G⁺, obtain subgraph S⊆G⁺, and evaluate the implementation options of vertex v∈S to minimize the execution cycle count while reducing the silicon area as many as possible under the following constraints:

1. IN(S) ≤ N_in, 2. OUT(S) ≤ Nout, 3. S is convex,

4. Load and store operations ∉ S.

IN(S) (OUT(S)) is the number of input (output) values used (generated) by a subgraph S (i.e. an ISE). The user-defined values Nin and Nout denote the read and write ports limitations of the register file, respectively. For a feasible instruction scheduling, an ISE must observe the convex constraint that the ISE’s output cannot connect to its input via other operations not grouped in subgraph S (i.e. ISE). In other words, if no path exists from a operation u∈S to another operation v∈S involving a operation w∉S, then S is convex. To conform to the

limitation of load-store architecture, the load and store operations are forbidden from being grouped into ISE.

4.3 ISE Exploration Algorithm

As mentioned above, the proposed algorithm explores ISE iteratively until no ISEs in a DFG can be found. The algorithm, therefore, would be performed for several rounds (a round comprises all steps in figure 4.3.1); except for last round, each round would produce at least one ISE. The kernel of each round (step 2 to step 9 in Fig. 4.3.1) would be executed repeatedly until convergence is achieved. Executing the steps rounded by

gray rectangle once is called one iteration.

Figure 4.3.1: ISE exploration flow

At each iteration, the proposed algorithm initially selects one implementation option from Ready-Matrix with respect to a chosen-probability (cp), which depends on trail and merit values. Ready-Matrix is a data structure which is very similar with ready list in list scheduling.

Figure 4.3.2 is an example of Ready-Matrix; “*” means no this implementation option.

Operation 1 Operation 2 Operation 3

The meaning of trail is the same with the pheromone in the ACO algorithm, i.e. the number of valid chosen times of an implementation option in previous iterations. The valid chosen time is counted only when choosing this implementation option can reduce the execution time.

Here, the trail value of hardware and software implementation option j of operation x is denoted by trail_x,HW-j and trail_x,SW-j, respectively. The merit value is defined as the benefit of one implementation option being selected, and it is obtained using the merit function, which is described in detail later. The merit value of of hardware and software implementation option j of operation x is represented by meritx,HW-j and meritx,SW-j, respectively. The chosen probability of an operation x is derived with:

{ }

The value of SP used in this paper is computed according to the number of child operations;

however, this value can also be obtained by other ways, e.g. calculating the mobility of operation. In addition, merit and SP have other meanings. Merit is mainly used to choose one implementation option for operations; while SP is responsible for selecting one operation among all ready ones. (An operation is ready if all dependencies for this operation have been resolved.) Since the difference in merit values between operations may be large, picking an

operation to schedule among ready ones is unfair by using such values. To overcome this problem, the merit values of operation must be normalized after performing merit computation (step 8 in Fig. 4.3.1).

After selecting an implementation option, the algorithm schedules the operation which has this chosen implementation option. The scheduling process (Operation-Scheduling) will be described in later. Then, executing following processes to update Ready-Matrix: (1) remove the operation which has the chosen implementation option; and (2) add the operation if all dependencies for this operation have been resolved. The algorithm repeatedly executes step 3 to 6 until all operations are scheduled. After all operations are scheduled, the algorithm updates trail values according to execution time, and then computes merit value of all implementation options of each operation in DFG by using merit function. Each round is repeatedly performed until the end condition is fulfilled, i.e. until converge. The end condition is that for all operations in DFG, the selected-probability (sp) of one of implementation denominator of Eq. 3 is only over all implementation options in one operation; while, for cp (Eq. 1), the sum in the denominator is over all implementation options in Ready-Matrix. A larger P_END has a higher opportunity of obtaining a better result, but typically takes a

longer time to converge. An implementation option with the chosen-probability (sp) larger than P_END is called a taken implementation option. An ISE is a set of connected/reachable nodes (i.e. operations) all of which have taken hardware implementation option. After convergence, the algorithm executes Make-Convex to let every ISE candidate comply with the convex constraint. But, if an ISE has conformed to the convex constraint, then the algorithm will skip this step. Make-Convex repeatedly divides the ISE candidate that does not conform to the convex constraint into smaller ones until all smaller ISE candidates can

在文檔中多重配送處理器架構下的延伸指令集探索 (頁 15-0)