Task Mapping - Task Binding - 應用於單晶片多處理器系統之任務結合方法

Chapter 3 Task Binding

3.3 Task Mapping

The three major placers commonly in use today are min-cut, simulated annealing, and analytic based placers. Usually, the use of analytic based placers is often followed by iterative improvement [19]. Since design of architecture of on-chip-networks is an important issue, we wish to explore much different architecture. Thus, the optimization goals of our placer may change from architecture to architecture. Among the three types of commonly used placers, the simulated annealing based placer can be more easily adapted to new optimization goals than min-cut and analytic based placers [20]. Therefore, we use simulated-annealing technique to map tasks onto processors.

3.3.1 Simulated Annealing

Simulated annealing proposed in [21] is a widely used heuristic to solve several combinatorial optimization problems including many well known CAD ones. It belongs to the class of non-deterministic algorithms. As its name suggests, simulated annealing mimics the annealing process used to gradually cool molten metal to produce high-quality metal objects. During the process, a metal is heated to a very high temperature and then slowly cooled down. At some proper cooling rate, the process has a very good chance of producing high-quality metal objects. If we compare optimization to the annealing process, the attainment of a good solution is analogous to that of refined metal objects.

For a combinatorial optimization problem, we wish to find out solutions with low costs in the solution space, which is a set containing all possible solutions. Some of these solutions correspond to local optima, while others may be global optima. In Figure 11, the solutions are shown along the x-axis. It is assumed that two consecutive solutions are local neighbors, which are solutions that can be reached from the original one with only a slight change. The cost of solution grows in the positive direction along the y-axis. Here, S1, S2, and L are local optima since all the local neighbors have higher and thus inferior costs. Among these three solutions, L, also called global optimum, has the minimum cost.

The process of an iterative improvement scheme starts with an initial solution.

Then the solution is refined again and again. Finally the procedure stops if it finds an optimum solution. For a greedy algorithm, if we start with an initial solution, say I in Figure 11, we gradually slide down the “hill” because the costs there are lower and stop once we reach S1. Although S1 is the best solution we can ever find, it is still a local optimum solution. The initial solution given prevents us from the global

optimum solution. There is no way for a greedy algorithm to find the global minimum solution under this situation, unless it “climbs the hill”.

L S2

Cost I

Solution Space

Figure 11 Local versus global optima

Simulated annealing is such a hill-climbing algorithm. This time the algorithm starts again with the initial solution I and examines its neighborhood. If a neighboring solution has a lower cost, it is always accepted. On the contrary, if a neighboring solution is worse, the algorithm occasionally accepts this inferior solution, since there may be times that the algorithm finds a better solution once it goes beyond the hill. By doing so, the algorithm escapes from getting stuck at a local optimum solution S1.

Pseudo-code for a generic simulated annealing based placer is shown in Figure 12. Here, a cost function is defined to assess the quality of the solution. We start with an initial solution by mapping tasks randomly onto processors. Then, a large number of swaps are made within a region, specified by the range limiter R, to gradually improve the solution and the change in cost is calculated. The swap is always accepted should the cost decrease. Otherwise, it still has a chance to be accepted. The probability of acceptance is given by e⁻^Diff^/^T, where Diff is the change in cost a change makes, and T is a parameter called temperature that controls the likelihood of

accepting moves. Initially, T is very high, which suggests that any move will be easily accepted. Then, it is gradually decreased as the solution is refined. Eventually, the probability of accepting a move that makes the solution worse will be very low.

Figure 12 Pseudo-code for generic simulated annealing placer

3.3.2 Cost Function

In FPGA placement, a placer usually tries to minimize the total wiring (wire-length driven), places blocks so as to balance the wiring density (routability-driven), or to maximize system performance (timing-driven). For a wire-length driven placer, estimation of wire length can be done by using the semi-perimeter method [19]. The method tries to find the smallest bounding box that encloses all the pins to be connected. The estimated wire length is half the perimeter of this rectangle. For example, in Figure 13, the estimated wire length is 9.

Figure 13 A bounding box

Since in our problem a connection path always connects to only two processors, the semi-perimeter method accurately calculates the distance that the transmitted data would travel. Therefore, if we let the cost of a solution be the summation of all distances between each paired processors, and try to minimize the cost, any pair of connecting processors will be placed as close as possible.

3.3.3 Annealing Schedule

The rate at which the temperature is decreased, the exit criterion to terminate the process, the number of moves attempted at each temperature, and the method by which potential swaps are made are defined by the annealing schedule. A good annealing schedule is essential to make sure high quality solution obtained in a reasonable time.

Because we still wish to explore much different architecture, the optimization goals of our placer may change from architecture to architecture. Therefore, a good fixed annealing schedule is still not enough. We need a schedule that automatically adapts to new architecture, no matter what our cost function is. Here we incorporate some best features from [20], [22], [23], and [24].

First, an initial solution is generated and some swaps are made. The initial

temperature is set to twenty times the standard deviation of the costs of these swaps [22]. The temperature is gradually adjusted to stay around a productive temperature where a significant fraction of swaps that makes improvement over the original solution are accepted [20]. Second, the number of moves attempted at each temperature is determined by a function of the number of processors, since the number of processors differs from case to case [24]. Then, the region within swaps are made is adjusted to keep the fraction of swaps accepted around 0.44 for as long as possible [23]. Finally, the procedure terminates when the temperature is less than some small fraction of the average cost of the solutions the algorithm examined [20].

Since detail of the annealing schedule is not our focus, we omit them from this thesis.

在文檔中應用於單晶片多處理器系統之任務結合方法 (頁 29-34)