Our Design Flow - 應用於單晶片多處理器系統之任務結合方法

Chapter 2 Preliminary

2.3 Our Design Flow

Figure 9 depicts our design flow that starts with an application. Initially, the algorithm for the application is chosen and partitioned into interacting tasks. Then, these tasks and their relationship will be modeled by a graph model. In this graph model, the computation and communication amount required for each individual task and paired tasks will be indicated. Since many algorithms contain feedback loops, an iteration bound, which is the lower bound on the achievable iteration period, will be imposed. It is not possible to achieve an iteration period less than the iteration bound, even when infinite resources are available. Here we will examine if our algorithm

meets the performance constraints. If not, we will modify and repartition the algorithm again, repeating the same procedure until those constraints are met.

Figure 9 Our design flow

After the algorithm analysis step, processors for each task will be allocated. We will try schedule these tasks. In this step, two important factors are considered:

memory size and computing power. For the former, since each processor has only limited memory size, we must make sure that the required data for and the intermediate data generated by a task can be stored into a processor. For the latter, we wish to share task loads equally among the allocated processors so that no processor is idle while others are busy. By carefully handling these points, the system performance will be improved because each processor will be utilized to the maximum.

The task binding process is performed next. In this step, we decide which task should be mapped onto which processors. The interacting tasks should be always mapped onto processors in the same region, reducing the time spent on their interaction. In our platform, dedicated connection paths reserved in advance are used when processors communicate. When assigning connection paths, we try best to find

the shortest path between each paired processors, while avoiding over usage of routing resources.

Finally, all information generated will be collected and fed to our simulator. We can run some applications on our simulator to see if they work well on our platform.

Chapter 3 Task Binding

In this chapter, the task binding problem will be discussed. First, the problem formulation of task binding will be given in 3.1. Then our solution to the problem is presented in 3.2. Finally, details of the techniques we exploit to solve this problem will be shown in 3.3 and 3.4.

3.1 Problem Formulation

Task binding problem can be formulated as:

Given

Applications A1~Ak

Corresponding task graph (directed-acyclic graph) Gi = G(V, E) for each application Ai

With

Each vertex vЄVi representing a task to run on one processor, and the amount of computation shown in the vertex

Each edge eЄE_i indicating data transmission along the arrow, and the amount of communication shown by the edge

NV_i representing total number of vertices in G_i, and NV representing the

summation of NVi; NV be less or equal to the number of processors available

We wish to

Map each vertex onto a processor

Place connected tasks as close as possible to reduce interaction time

Find a corresponding connection path for each edge

Minimize the total routing resource requirement

3.2 Task Binding Flow

The process of task binding tries to find out a solution that every vertex in task graphs be mapped onto a processor, and for each pair of connected vertices, a connection path be reserved for communication. Here, because of the problem complexity, the process is divided into two parts.

The former part of the process, task mapping, is analogous to placement problem in FPGA. But in our problem, instead of determining which logic block within an FPGA should implement each of the logic blocks required by the circuit, we decide which processor should execute which task.

The latter part of the process, connection path assignment, does almost the same as does routing techniques in FPGA. In FPGA, there can be one source node and many sink nodes connecting to a signal wire. On the contrary, in our platform, only one source node and one sink node can be connected to the ends of a virtual channel.

This is because it is data that we want to pass along a virtual channel, not voltages.

In Figure 10, the flow of task binding is shown. First, a task graph containing the

computation/communication information for all tasks is given. Note that this task graph should be a directed-acyclic graph. After this, we exploit placement techniques used in FPGA to map tasks onto processors. If any two tasks communicate, processors in the same region will be allocated for these tasks. Then, we’ll reserve connection paths for these tasks. Therefore, when data transmission is to occur, pre-allocated and dedicated connection paths will be used. Finally, the position information of processor belonging to each task, the information of connection path for any two interconnected processors and the number of virtual channels needed across a physical channel will be reported.

Figure 10 Task binding flow

3.3 Task Mapping

The three major placers commonly in use today are min-cut, simulated annealing, and analytic based placers. Usually, the use of analytic based placers is often followed by iterative improvement [19]. Since design of architecture of on-chip-networks is an important issue, we wish to explore much different architecture. Thus, the optimization goals of our placer may change from architecture to architecture. Among the three types of commonly used placers, the simulated annealing based placer can be more easily adapted to new optimization goals than min-cut and analytic based placers [20]. Therefore, we use simulated-annealing technique to map tasks onto processors.

3.3.1 Simulated Annealing

Simulated annealing proposed in [21] is a widely used heuristic to solve several combinatorial optimization problems including many well known CAD ones. It belongs to the class of non-deterministic algorithms. As its name suggests, simulated annealing mimics the annealing process used to gradually cool molten metal to produce high-quality metal objects. During the process, a metal is heated to a very high temperature and then slowly cooled down. At some proper cooling rate, the process has a very good chance of producing high-quality metal objects. If we compare optimization to the annealing process, the attainment of a good solution is analogous to that of refined metal objects.

For a combinatorial optimization problem, we wish to find out solutions with low costs in the solution space, which is a set containing all possible solutions. Some of these solutions correspond to local optima, while others may be global optima. In Figure 11, the solutions are shown along the x-axis. It is assumed that two consecutive solutions are local neighbors, which are solutions that can be reached from the original one with only a slight change. The cost of solution grows in the positive direction along the y-axis. Here, S1, S2, and L are local optima since all the local neighbors have higher and thus inferior costs. Among these three solutions, L, also called global optimum, has the minimum cost.

The process of an iterative improvement scheme starts with an initial solution.

Then the solution is refined again and again. Finally the procedure stops if it finds an optimum solution. For a greedy algorithm, if we start with an initial solution, say I in Figure 11, we gradually slide down the “hill” because the costs there are lower and stop once we reach S1. Although S1 is the best solution we can ever find, it is still a local optimum solution. The initial solution given prevents us from the global

optimum solution. There is no way for a greedy algorithm to find the global minimum solution under this situation, unless it “climbs the hill”.

L S2

Cost I

Solution Space

Figure 11 Local versus global optima

Simulated annealing is such a hill-climbing algorithm. This time the algorithm starts again with the initial solution I and examines its neighborhood. If a neighboring solution has a lower cost, it is always accepted. On the contrary, if a neighboring solution is worse, the algorithm occasionally accepts this inferior solution, since there may be times that the algorithm finds a better solution once it goes beyond the hill. By doing so, the algorithm escapes from getting stuck at a local optimum solution S1.

Pseudo-code for a generic simulated annealing based placer is shown in Figure 12. Here, a cost function is defined to assess the quality of the solution. We start with an initial solution by mapping tasks randomly onto processors. Then, a large number of swaps are made within a region, specified by the range limiter R, to gradually improve the solution and the change in cost is calculated. The swap is always accepted should the cost decrease. Otherwise, it still has a chance to be accepted. The probability of acceptance is given by e⁻^Diff^/^T, where Diff is the change in cost a change makes, and T is a parameter called temperature that controls the likelihood of

accepting moves. Initially, T is very high, which suggests that any move will be easily accepted. Then, it is gradually decreased as the solution is refined. Eventually, the probability of accepting a move that makes the solution worse will be very low.

Figure 12 Pseudo-code for generic simulated annealing placer

3.3.2 Cost Function

In FPGA placement, a placer usually tries to minimize the total wiring (wire-length driven), places blocks so as to balance the wiring density (routability-driven), or to maximize system performance (timing-driven). For a wire-length driven placer, estimation of wire length can be done by using the semi-perimeter method [19]. The method tries to find the smallest bounding box that encloses all the pins to be connected. The estimated wire length is half the perimeter of this rectangle. For example, in Figure 13, the estimated wire length is 9.

Figure 13 A bounding box

Since in our problem a connection path always connects to only two processors, the semi-perimeter method accurately calculates the distance that the transmitted data would travel. Therefore, if we let the cost of a solution be the summation of all distances between each paired processors, and try to minimize the cost, any pair of connecting processors will be placed as close as possible.

3.3.3 Annealing Schedule

The rate at which the temperature is decreased, the exit criterion to terminate the process, the number of moves attempted at each temperature, and the method by which potential swaps are made are defined by the annealing schedule. A good annealing schedule is essential to make sure high quality solution obtained in a reasonable time.

Because we still wish to explore much different architecture, the optimization goals of our placer may change from architecture to architecture. Therefore, a good fixed annealing schedule is still not enough. We need a schedule that automatically adapts to new architecture, no matter what our cost function is. Here we incorporate some best features from [20], [22], [23], and [24].

First, an initial solution is generated and some swaps are made. The initial

temperature is set to twenty times the standard deviation of the costs of these swaps [22]. The temperature is gradually adjusted to stay around a productive temperature where a significant fraction of swaps that makes improvement over the original solution are accepted [20]. Second, the number of moves attempted at each temperature is determined by a function of the number of processors, since the number of processors differs from case to case [24]. Then, the region within swaps are made is adjusted to keep the fraction of swaps accepted around 0.44 for as long as possible [23]. Finally, the procedure terminates when the temperature is less than some small fraction of the average cost of the solutions the algorithm examined [20].

Since detail of the annealing schedule is not our focus, we omit them from this thesis.

3.4 Connection Path Assignment

3.4.1 Pathfinder Algorithm

Once processors for all the tasks have been chosen, a router tries to assign connection paths between any pair of interconnected processors. Here, we use the router algorithm like the one proposed in [25] to solve this problem. This router is essentially a variant of maze router [26]. It runs Dijkstra’s algorithm [27] to find the shortest, the lowest cost, path between a sender and a receiver processor. The Pathfinder algorithm [25] then performs a multiple of routing iterations to rip up some or all nets and reroute them by different paths, in case there is a competition for routing resources that makes the routing illegal. However, note that ripping up and rerouting these nets only affect the net ordering. These nets are all routed by the same maze routing algorithm.

The cost of using a routing resource n is defined in [25] as:

and pseudo-code for the Pathfinder algorithm is shown in Figure 14. The Pathfinder algorithm exploits the idea from Nair [28] to repeatedly rip up and reroute every path until all congestion is resolved. Ripping-up and rerouting every net once is called a routing iteration. During the first iteration, every path is routed for minimum cost, even if this leads to overuse of some routing resources. However, a routing in which some resources are overused is not a legal solution. As a consequence, when overuse exists at the end of a routing iteration, more iteration must be performed to resolve this situation.

During each iteration, the present congestion cost, p(n), will be updated every time a path is ripped-up and rerouted. At the end of each iteration, the historical congestion cost, h(n), of overusing a routing resource is updated to record the severity of historical congestion over this routing resource. Therefore, it is less probable for the router algorithm to find a path passing this resource in the next iteration. As a result, all congestion will be gradually resolved.

Figure 14 Pseudo-code for Pathfinder algorithm

3.4.2 Routing Resource Graph

The internal representation we incorporated here should be as architecture independent as possible, so we can easily describe different architecture without making any change to the algorithm. Here we use routing resource graph representation [25] to describe architecture internally.

In routing resource graph representation, processors and buffers over virtual channels become nodes. Virtual channels become directed edges, indicating that data over them flows unidirectionally. For each node, a capacity is assigned to specify the maximum number of different paths that can use this node in a legal routing. And the number of different paths currently using each node is indicated in the occupancy field. Since potential connections all become edges in a routing resource graph, routing a connection corresponds to finding a path in the graph, starting from a SOURCE node to a SINK node.

Figure 15 illustrates how the part of the routing resource graph between a processor and its local switch is constructed. For data flowing out the processor, it starts at a SOURCE node and flows to the OUT node. After it arrives at the OUT node, it chooses to which one of the four SW_IN nodes it will go. Here, the label indicated in each SW_IN node suggests that this node will later connect to the SW_OUT node at that side of the switch.

DECISION

Figure 15 Routing resource graph for a processor and its local switch

Since data flowing into the processor may come from any of the four sides of the local switch, there are four corresponding SW_OUT nodes, each connecting to the IN node. The IN node then connects to the SINK node. Once data arrives the SINK node, its journey stops.

Here, the capacity of each SW_IN or SW_OUT type node is set equal to the channel width factor. The channel width factor is the maximum allowable number of paths passing these two type nodes. If the channel width factor is four, at most four connection paths may pass any of these nodes. Because data on these nodes may flow into or out SOURCE, OUT, IN, SINK type nodes, the capacity of each these nodes is four times the channel width factor.

Figure 16 shows how the part of the routing resource graph inside a switch is constructed. For reason of clarity, we only show edges to and from the local processor.

After data flows out from the local processor, it can choose to go to the adjacent switch at any side. This is modeled by four pairs of SW_IN and SW_OUT nodes. On

the contrary, the situation that data can flow from the adjacent switch at each side is modeled by another four pairs of SW_IN and SW_OUT nodes.

DECISION

Figure 16 Routing resource graph for a switch

Finally, Figure 17 shows the part of the routing resource graph that will be used when data flows to and from the adjacent switches. In this figure, the data transmitted to the left switch is from any of the four SW_OUT nodes labeled W at the W side of the right switch. A CHANX type node whose capacity is four times the channel width factor is then passed. And, data goes to any of the four SW_IN nodes at the E side of the left switch. If the one labeled N is chosen, for example, data will go to the corresponding SW_OUT node at the N side of the left switch.

Figure 17 Routing resource graph across a physical channel

3.4.3 Cost Function

Here we define the cost of a routing resource somewhat differently than [25].

The cost of using routing resource, n, is defined as:

)

where b(n), h(n), and p(n) are the base cost, historical congestion, and present congestion terms mentioned in 3.4.1. Instead of adding b(n) and h(n) together, we multiply them. When adding terms together in cost function, it is very important to make sure that they are properly normalized to the same range of magnitude so that both terms work effectively. We avoid this by multiplying them together.

The base cost of a node, b(n), is set to reflect the latency that data transmission will experience when passing this node. A router is encouraged to use as few nodes as possible to route each connection path. Table 2 shows the base cost for each type of routing resource. Note that no matter what exact cost is chosen for each type of

routing resource, the router always makes sure that no routing resource is overused.

This is guaranteed by the congestion avoidance factors, h(n) and p(n), in the cost function.

Resource Type Base Cost

SOURCE, OUT 0

SW_IN, SW_OUT 1

CHANX, CHANY 2

IN, SINK 0

Table 2 Base cost for each type of routing resource

In fact, SW_IN is a node that does not really exists in our switch. However, since there is only one possible connection from a routing resource of type SW_IN to its corresponding SW_OUT type resource, we can set both their costs to 1, which is half the value of cost for CHANX or CHANY type resource. Somehow, if we set costs of SW_IN and SW_OUT type node to 0 and 2, the router performance will degrade, since the cost of SW_IN will always be 0 no matter what value h(n) and p(n) are. This suggests that our router may not be aware of congestion problem on SW_IN type

在文檔中應用於單晶片多處理器系統之任務結合方法 (頁 24-0)