ISE exploration in this work not only identifies frequently executed instruction patterns as ISE candidates, but also evaluates hardware implementation options of each operation in ISE candidates to minimize the execution time while using less silicon area. The input and output of ISE exploration algorithm are basic blocks and ISE candidates as well as their hardware implementation options, respectively. The
implementation option(s) of an operation denote(s) its implementation method(s), and can be roughly divided into two categories, namely hardware and software.
The flow of ISE exploration is briefly described as follows. According to the results of BB selection, each selected basic block is transformed to a data flow graph (DFG). DFG is represented by a directed acyclic graph G(V,E) where V denotes a set of vertices, and E represents a set of directed edges. Every vertex v V is an assembly instruction in the basic block, called an “operation” or “node” hereafter. Each edge (u,v) E from operation u to operation v signifies that the execution of operation v needs the data generated by operation u. Then, an implementation option (IO) table representing all implementation options for an operation is appended to each operation in a DFG. Using the DFG with IO table, ISE exploration algorithm is repeatedly executed until no ISE candidate can be discovered. Significantly, the ISE exploration algorithm identifies at least one ISE candidate at each round. A round usually consists of multiple iterations.
At each iteration, the ISE exploration algorithm initially selects one implementation option in each operation according to a probability value (p), which is a function of trail and merit values. Note that every implementation option has one probability value (p). The meaning of trail is the same with the pheromone in the ACO algorithm, i.e. how many times an implementation option is chosen in previous iterations. The merit value is the benefit of one implementation option being chosen.
The trail value of the selected implementation option is increased, while others (i.e.
non-selected implementation options) are decreased after making a choice. Restated, the result having the highest merit value can be regarded as a local optimal solution.
The trail value guides the current solution to achieve global optimal, and the probability value (p) assists the current solution in departing local optimal. After updating trail values, the algorithm evaluates all implementation options of each operation in DFG, i.e. it computes their merit value using merit function. This process is iteratively performed until convergence criteria are met, i.e. until the probability values (p) of all operations have exceeded a predefined threshold value.
3.1 Implementation option
An operation normally has multiple implementation options, which can be divided into two categories, namely hardware and software. The hardware implementation option means that the operation is included in an ISE and is implemented in additional hardware, i.e. ASFU. Because of different speed and area requirements, most operations usually have multiple hardware implementation options. By contrast, the software implementation option signifies that the operation is performed in the CPU core.
To represent all implementation options for an operation, a table, called
implementation option (IO) table, is appended to every operation. Each entry in the IO table comprises three fields, namely implementation option, delay and area. The name of implementation option is shown in implementation option field. The execution time and the extra silicon area cost of one implementation option are shown in delay field and area field, respectively. Obviously, using software implementation option requires at least one execution cycle, but does not introduce any extra silicon cost. Conversely, using the hardware implementation option can reduce the number of execution cycle, but increases the silicon area consumed. A new graph G+ is generated after the IO table is appended to G. Figure 3 shows an example of G+, consisting of two operations, A and B.
Figure 3: An example of G+
3.2 Formulation of ISE exploration
ISE exploration explores ISE candidates in G+, and evaluates the implementation options of each operation in ISE candidates. An ISE candidate in G+ is a subgraph S G+. The proposed ISE exploration can be formulated as follows.
ISE exploration: Considering a graph G+, obtain subgraph S G+, and evaluate the implementation options of vertex v S to minimize the execution cycle count while reducing the silicon area as many as possible under the following constraints:
1. IN(S) ≤ Nin, 2. OUT(S) ≤ Nout, 3. S is convex,
4. Load and store operations∉ S.
IN(S) (OUT(S)) is the number of input (output) values used (generated) by a subgraph S (i.e. an ISE). The user-defined values Nin and Nout denote the read and write ports limitations of the register file, respectively. To conform to the limitation of load-store architecture, the load and store operations are forbidden from being grouped into ISE.
3.3 ISE exploration algorithm
The main task of the proposed ISE exploration algorithm can be considered as assigning an implementation option (including hardware and software) for each operation in the basic block to minimize the execution time and silicon area cost.
Therefore, how to choose a “right” implementation option for an operation is crucial for the proposed ISE exploration algorithm. As with the ACO algorithm, the implementation option is chosen according to its probability value (p). The probability value (p) of an implementation option is the implementation option’s probability of being selected at each iteration of the ISE exploration algorithm. The reason using the probability value (p) is that selecting the implementation option based on its probability value (p) can prevent local optimal solutions. The probability value (px,j) of implementation option j in operation x is a function of the trail and the merit values, as revealed in Equation (1). The significance of the trail value is identical to that of the pheromone in the ACO algorithm, and reveals the number of times that an implementation option is selected in previous iterations. Here, the trail value of implementation option j of operation x is denoted by trailx,j, and trailx,0 is designated as the trail value of software implementation option. The trail value, like the pheromone, must be updated at each iteration. The merit value is defined as the benefit of one implementation option being selected, and it is obtained by using the merit function, which is described in detail later. The merit value of implementation option j of operation x is represented by meritx,j, and meritx,0 is designated as the merit value of the software implementation option. The probability value of implementation option j of operation x (px,j) is derived with:
where k is the number of hardware implementation options in operation x, and α is utilized to determine the relative influence of trail and merit, and
1 Figure 4 shows the pseudo code of the proposed ISE exploration algorithm. Here, a DFG is assumed to have m (m > 0) operations, each with n (n > 0) implementation options. Initially, i.e. in step 1, the algorithm sets initial values for the trail and merit values of each implementation option of all operations. Notably, the hardware implementation options have higher initial merit values than software ones such that the algorithm could preferentially choose the hardware implementation option at the start of execution to achieve higher performance improvement. In step 2, the algorithm verifies all operations to determine whether they have hardware implementation options. If yes, then the algorithm selects one implementation option
(including hardware and software) in each operation based on the probability value; if no, then it selects the software implementation option. In Step 3, the algorithm updates the trail value of each implementation option in all operations. The trail value of the chosen implementation option (i.e. the implementation option has been selected in Step 2) is raised with increasing ρ, a positive constant value, while those of others are reduced. Here, ρ is called evaporating factor and very similar to the evaporation rate in ACO. The algorithm in Step 4 derives the merit value of each implementation option in all operations. As in Step 2, the algorithm first checks each operation to determine whether it has a hardware implementation option. If yes, then the algorithm executes the Hardware Grouping function, which determines whether an operation can be grouped with its reachable operations as a virtual ISE candidate. If it can be grouped, then the Hardware-Grouping function adopts this virtual ISE candidate to obtain the execution time and silicon area for every hardware implementation option in this operation. The Hardware-Grouping function is described in detail later. The ISE exploration algorithm then computes the merit value with the merit function.
Finally, the ISE exploration algorithm checks the end condition in Step 5. If the end condition is not fulfilled, then the ISE exploration algorithm returns to Step 2 and enters the next iteration; otherwise, it terminates.
The end condition is that for all operations in DFG, the probability value (p) of one of implementation options exceeds P_END, which is a predefined threshold value and is very close to 100%. A larger P_END has a higher opportunity of obtaining a better result, but typically takes a longer time to converge. An implementation option with the probability value (p) larger than P_END is called a taken implementation option. An ISE candidate is a set of reachable nodes (i.e. operations) all of which have taken hardware implementation option.
Figure 4: ISE Exploration Algorithm Hardware-Grouping
Hardware-Grouping checks whether the operation x can be grouped with its reachable nodes (i.e. operations) as a virtual ISE candidate, and recursively groups operation x with its reachable nodes, which have chosen hardware implementation option in previous iteration, as a virtual ISE candidate, i.e. a virtual subgraph vSx. The result of Hardware-Grouping of operation x using implementation option j is denoted as vSx,j. Significantly, vSx is the set of all vSx,j (i.e. vSx={ vSx,j | j = 1 to n}), and vSx,0 is meaningless due to implementation option 0 is the software option. Using vSx,j, Hardware-Grouping measures the execution time, silicon area and register read/write port usage of vSx,j.The execution time of vSx,j is the critical path time in vSx,j, the silicon area of vSx,j is the sum of silicon areas used by all operation in vSx,j, and register read/write port usage is the number of register file read/write ports utilized by
1. (Initialization)
For implementation option j (j=0 to n) of operation x (x=1 to m) in DFG trailx,j = 0;
If (j=0)
meritx,0 = initial value of software implementation option;
Else
meritx,j = initial value of hardware implementation option;
2. (Calculating probability value (p) and choosing implementation option) For operation x (x=1 to m)
If (x has hardware implementation option)
For implementation option j (j=0 to n) in operation x Calculate px,j;
Choose one implementation option according to its probability value (p);
Else
Choose software implementation option;
3. (Trail update)
For implementation option j (j=0 to n) of operation x (x=1 to m) in DFG If the implementation option is selected
trailx,j = trailx,j + ρ; Else
trailx,j = trailx,j − ρ;;
4. (Calculating merit)
For operation x (x=1 to m)
If (x has hardware implementation option) Hardware_Grouping;
For implementation option j (j=1 to n) in operation x Calculate meritx,j;
5. (Terminating condition)
If not (end_condition) goto step 2;
vSx,j. Notably, cumulating the silicon area used by each operation in vSx,j may not reflect the silicon area utilized by vSx,j in real; however, it can simplify the calculation of silicon area, and the cumulating result is the upper bound of silicon area; that is, more silicon area saving can be achieved in real case.
Figure 5: Examples of Hardware-Grouping
Figure 5 depicts the working of the Hardware-Grouping function. The table in Fig. 5 lists the delay and area of each implementation option of all operations, and specifies the chosen implementation option in the previous selection. In both the top and bottom left of Fig. 5, nodes grouped by a dotted line are treated as a virtual ISE candidate. For operation #2, Hardware-Grouping groups operation #2 and #3 as a virtual ISE candidate, i.e. vS2, as shown in the top left of Fig. 5. Because operation #2 only has one hardware implementation option, vS2 has one evaluation result, namely vS2,1 (execution time = 0.8, silicon area = 1200). The bottom left of Fig. 5 is another example, in which Hardware-Grouping groups operation #5 and other nodes, are #2,
#3, #6 and #7, as a virtual ISE candidate, i.e. vS5. Since operation #5 has two hardware implementation options, vS5 has two evaluation results, namely vS5,1
(execution time = 1.7, silicon area = 2400) and vS5,2 (execution time = 1.4, silicon area = 3000).
Hardware grouping of operation #5 Hardware grouping of operation #2
Operation
Option Delay Area
1 ● software 1 0
Merit Function
The merit function determines the benefit, i.e. merit value, of different implementation options in an operation. Briefly, the merit function consists of three cases, size checking (case 1), constraints violation determination (case 2) and performance as well as area benefits calculating (case 3). Figure 6 shows the merit function algorithm. Initially, in case 1, the algorithm checks whether size(vSx,j), which is the number of operation in vSx,j, is equal to 1. Notably, this work assumes that every operation is one-cycle delay in original processor specification. If a multiple-cycle delay is assumed, then case 1 should be tailored to fit this situation. If size(vSx,j) = 1, then vSx,j only has one operation x such that the performance cannot be improved.
Therefore, the algorithm multiplies the merit value of every hardware implementation option by a constant β1 (0 < β1 < 1) to lower the chance of it being chosen. The calculation of the merit function is then terminated. If no, then goto case 2.
Case 2 verifies whether vSx violates input/output port and/or convex constraints.
If yes, then the merit value of each hardware implementation option is multiplied by constant β2 and/or β3 (0 < β2 < 1 and 0 < β3 < 1), reducing the opportunity for selecting the hardware implementation option, as in case 1. The calculation of the merit function is then terminated. Since operation x may have chance to be grouped in an ISE candidate at the following iterations, the algorithm only divides the merit value of each hardware implementation option by a constant. If the algorithm does not allow the possibility of operation x becoming an operation in an ISE candidate, the optimal solution may also be excluded. If no, then enter case 3.
In case 3, the merit value of implementation option j (meritx,j, j > 0) in operation x is computed according to (1) the speedup that can be achieved by vSx,j, and (2) the silicon area utilized by vSx,j. The execution cycle reduction and silicon area of the virtual subgraph vSx,j is represented by cycle_savingx,j and areax,j, respectively. The basic concept of case 3 is: (1) if vSx,j can improve the performance, then all hardware implementation options should have larger merit value than the software one; (2) the merit value should be direct proportion to the execution time reduction, and (3) under the same performance improvement, the hardware implementation option using less silicon area should have larger merit value Accordingly, the merit of software implementation option is always set as a constant, meritx,0, to be a baseline. In case 3, the algorithm first sets the merit value of implementation option j as the product of cycle_savingx,MAX + 1 and meritx,0 ,where cycle_savingx,MAX is the maximal execution cycle reduction achieved by vSx. The algorithm then checks whether the execution time reduction of implementation option j is equal to the cycle_savingx,MAX. If yes, then the algorithm adjusts the merit value according to the silicon area of implementation option j. Here, areax,MAX, represents the largest silicon area consumed
by vSx. Note that the difference between areax,MAX and areax,j as well as cycle_savingx,MAX and cycle_savingx,j are only in operation x. Restated, besides operation x, all operations in vSx deploy the same hardware implementation option. If no, then the merit of implementation option j is divided by the difference between cycle_savingx,j and cycle_savingMAX + 1.
Figure 6: Algorithm of merit function