Chapter 3 Task Binding
3.4 Connection Path Assignment
3.4.2 Routing Resource Graph
The internal representation we incorporated here should be as architecture independent as possible, so we can easily describe different architecture without making any change to the algorithm. Here we use routing resource graph representation [25] to describe architecture internally.
In routing resource graph representation, processors and buffers over virtual channels become nodes. Virtual channels become directed edges, indicating that data over them flows unidirectionally. For each node, a capacity is assigned to specify the maximum number of different paths that can use this node in a legal routing. And the number of different paths currently using each node is indicated in the occupancy field. Since potential connections all become edges in a routing resource graph, routing a connection corresponds to finding a path in the graph, starting from a SOURCE node to a SINK node.
Figure 15 illustrates how the part of the routing resource graph between a processor and its local switch is constructed. For data flowing out the processor, it starts at a SOURCE node and flows to the OUT node. After it arrives at the OUT node, it chooses to which one of the four SW_IN nodes it will go. Here, the label indicated in each SW_IN node suggests that this node will later connect to the SW_OUT node at that side of the switch.
DECISION
DECISION
Figure 15 Routing resource graph for a processor and its local switch
Since data flowing into the processor may come from any of the four sides of the local switch, there are four corresponding SW_OUT nodes, each connecting to the IN node. The IN node then connects to the SINK node. Once data arrives the SINK node, its journey stops.
Here, the capacity of each SW_IN or SW_OUT type node is set equal to the channel width factor. The channel width factor is the maximum allowable number of paths passing these two type nodes. If the channel width factor is four, at most four connection paths may pass any of these nodes. Because data on these nodes may flow into or out SOURCE, OUT, IN, SINK type nodes, the capacity of each these nodes is four times the channel width factor.
Figure 16 shows how the part of the routing resource graph inside a switch is constructed. For reason of clarity, we only show edges to and from the local processor.
After data flows out from the local processor, it can choose to go to the adjacent switch at any side. This is modeled by four pairs of SW_IN and SW_OUT nodes. On
the contrary, the situation that data can flow from the adjacent switch at each side is modeled by another four pairs of SW_IN and SW_OUT nodes.
DECISION
DECISION
Figure 16 Routing resource graph for a switch
Finally, Figure 17 shows the part of the routing resource graph that will be used when data flows to and from the adjacent switches. In this figure, the data transmitted to the left switch is from any of the four SW_OUT nodes labeled W at the W side of the right switch. A CHANX type node whose capacity is four times the channel width factor is then passed. And, data goes to any of the four SW_IN nodes at the E side of the left switch. If the one labeled N is chosen, for example, data will go to the corresponding SW_OUT node at the N side of the left switch.
I
Figure 17 Routing resource graph across a physical channel
3.4.3 Cost Function
Here we define the cost of a routing resource somewhat differently than [25].
The cost of using routing resource, n, is defined as:
)
where b(n), h(n), and p(n) are the base cost, historical congestion, and present congestion terms mentioned in 3.4.1. Instead of adding b(n) and h(n) together, we multiply them. When adding terms together in cost function, it is very important to make sure that they are properly normalized to the same range of magnitude so that both terms work effectively. We avoid this by multiplying them together.
The base cost of a node, b(n), is set to reflect the latency that data transmission will experience when passing this node. A router is encouraged to use as few nodes as possible to route each connection path. Table 2 shows the base cost for each type of routing resource. Note that no matter what exact cost is chosen for each type of
routing resource, the router always makes sure that no routing resource is overused.
This is guaranteed by the congestion avoidance factors, h(n) and p(n), in the cost function.
Resource Type Base Cost
SOURCE, OUT 0
SW_IN, SW_OUT 1
CHANX, CHANY 2
IN, SINK 0
Table 2 Base cost for each type of routing resource
In fact, SW_IN is a node that does not really exists in our switch. However, since there is only one possible connection from a routing resource of type SW_IN to its corresponding SW_OUT type resource, we can set both their costs to 1, which is half the value of cost for CHANX or CHANY type resource. Somehow, if we set costs of SW_IN and SW_OUT type node to 0 and 2, the router performance will degrade, since the cost of SW_IN will always be 0 no matter what value h(n) and p(n) are. This suggests that our router may not be aware of congestion problem on SW_IN type node and may spend more time to resolve resource congestion problem.
On the other hand, since the maze expansion used to route a connection path always begins with a pair of SOURCE and OUT type nodes, the exact costs set for them do not matter. We set them to zero to save some computation. Also, the expansion terminates when it reaches a pair of IN and SINK type node. By setting their costs to zero, some CPU savings can be obtained because the maze expansion tends to stop earlier before it expands further.
The present congestion penalty is updated whenever any net is ripped-up and
(n occupancy n capacity n pfac
p = + + − × ,
where occupancy(n) is the number of connection paths currently using routing resource n and capacity(n) is the maximum allowable number of paths that can legally use node n. The historical congestion factor is updated only after an entire routing iteration. Its value during routing iteration i is:
1
The value of hfac can be kept constant for all routing iterations. The fact that h(n) increments after each iteration already provides sufficient increase in the historical congestion factor. As for pfac, the higher the value, the faster the speed that can be reached when resource congestion problem happens in a routing iteration. However, if it is assigned a small value initially and gradually incremented from iteration to iteration, a better routing quality is obtained. The router under this condition will try to solve congestion problem while maintaining all connection paths short.
Chapter 4
Experimental Results
4.1 Experiment Flow
Figure 18 shows our experiment flow. First, we exploit Task Graph For Free (TGFF) [29], a user-controllable, general-purpose, pseudorandom task graph generator, to generate many random cases. Then, we extract some information from the generated cases and pass them to our task binding tool. After each task is mapped onto a processor and each connection path is assigned, we incorporate the routing information with our platform simulator and run simulation. Finally, we use some scripts to extract important parameters from the log file and analyze these data.
Figure 18 Our experiment flow
For each case we generated, there will be 1 to 5 independent task graphs, each containing at least 4 to 20 tasks. The maximum number of inputs/outputs of each task is from 2 to 10.
Each case is composed of many nodes and arcs. A node represents a task to be run on one processor. An arc drawn from one node to another node indicates that communication exists between these two nodes, flowing from the former to the latter.
The amount of computation of a node and the amount of communication of an arc will also be indexed by the numbers shown by the corresponding entries. The mapping between TGFF output file and its corresponding task graph is shown in Figure 19.
Figure 19 Mapping between output of each case and its corresponding task graph
Our simulator uses two component models: the processor model and the switch model described in 2.2.3. In our processor model, we assume that a processor begins to operate only when all input data is available and the output buffer size is enough for data that will be later generated. For example, the processor allocated for the task shown in Figure 20 requires two input I0 and I1 from the preceding processors. Once
it gets all the data, the processor checks to see if the buffer size is enough for output data that will be later generated. If so, it begins to operate and puts all output data into the buffer. The corresponding switches then start to send these data to the subsequent processors.
Figure 20 A task with two inputs and three outputs
4.2 Experimental Results
4.2.1 Routability Analysis
In our platform the minimum channel width factor required is determined by the maximum number of inputs or outputs of all tasks. (The channel width factor indicates the maximum allowable number of virtual channels passing each input/output queue at each side of a switch.) If a task requires five inputs, there is no way for a switch with a channel width factor one to support them because only four connection paths flowing into the corresponding processor could be established.
As shown in Figure 21, among the generated 765 cases, only 7 cases require a
channel width factor of 4; and 4 cases, a factor of 6. A platform with a channel width factor of 3 will be able to support all other cases. This suggests that the switch cost can be very small, if the requirement for applications fall into the range of the generated cases.
Figure 21 Requirement of channel width factor over 765 cases
4.2.2 Performance Analysis
In our experiment, we assume that the computation amount for a task is the number of cycles the allocated processor takes when it runs this task. Also, if the bandwidth of each physical channel is one unit, we assume that a data transmission runs a minimum of N cycles (under ideal condition), N equaling to the communication amount indicated in the task graph.
After simulation, we collect information from the output file, calculate the performance gain of the system and count the utilization rate of each processor. The equations for system performance gain and processor utilization rate are listed in the
shaded box in Figure 22. If an application composed of many tasks runs on only one processor, the total time E required to run this application once will be the summation of the computation time of all tasks. Suppose that the computation load is distributed on N processors. The performance gain will be the system throughput T times E, divided by the number of simulation cycles S. And the utilization rate for each processor will be the performance gain divided by N. Note that in Figure 22 the number of times a task has been executed in S cycles is indicated in the circle. For this case, the system throughput equals to 7.
10
9 8
7
S = Simulation Cycle
E = Execution Time on One Processor T = System Throughput
N = Number of Processors Performance Gain = T*E/S
Processor Utilization Rate = T*E/S/N
Figure 22 Performance measurement
We experiment on 765 different cases with different communication loads. The computation amount is set to an average of 200, with minimum and maximum value equal to 100 and 300. The communication amount is set the same way when the ratio of computation to communication equals to 1. Note here that the ratio is defined to be the average computation amount divided by the average communication amount.
The relationship between communication load and average processor utilization
rate is shown in Figure 23. Note that if we set the ratio to a value greater than 4, the average processor utilization rate will be higher than 0.6. However, if the communication load is heavier, say with a ratio below 3, the average processor utilization rate falls quickly. This is not a favorable situation. To overcome this problem, a designer may choose a system with greater physical channel bandwidth.
For example, if a system with a ratio equal to 2 is run on a platform with the original physical channel bandwidth doubled, the average processor utilization rate will be improved from an average value below 0.5 to an average value higher than 0.6.
0
Figure 23 Relationship between communication load and processor utilization rate
4.2.3 Scalability Analysis
If an application has a maximum number of inputs or outputs greater than 4, there must be some physical channels shared by at least two virtual channels.
Therefore, some input/output data is not transmitted at full speed to the subsequent
tasks. Having to wait longer, these tasks lower the system performance.
Here we experiment on 765 tasks with a computation to communication ratio of 3. The reason why 3 is chosen is because the impact of maximum degree of inputs or outputs is apparent when communication load is heavy. Otherwise it is not that clear, since there is little traffic on the platform if the communication load is low.
In Figure 24, the straight line drawn from the left to the right predicts the processor utilization rate, as a function of the number of processors on the platform.
With nearly 70 tasks distributed on our platform under such a heavy communication load, the processor utilization rate still maintains above 0.5. This is because only few physical channels are shared with the maximum number of inputs or outputs equal to 4.
Figure 24 Experiment with the maximum number of inputs or outputs equal to 4
On the other hand, the scalability of our platform is not good with maximum number of inputs or outputs equal to 7. In Figure 25, the straight line predicts the processor utilization rate when the number of processors grows. With about 70 tasks distributed on our platform, only processor utilization rate below 0.4 is achievable;
should more tasks be mapped onto the platform, the processor utilization rate may be worse.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0 10 20 30 40 50 60 70 80
Number of Processors
Average Processor Utilization Rate
Figure 25 Experiment with the maximum number of inputs or outputs equal to 7
Chapter 5
Conclusion and Future Work
In this work, the task binding problem is formulated and solved by techniques similar to those of placement and routing in FPGA. By incorporating the processor model, the switch model and the connection path information generated by our task binding tool, systems with different configurations can be simulated in a short time.
Some important parameters are then extracted from the simulation output file and the performance of the system can be assessed before the system is implemented.
In Chapter 4, performances of systems with different configurations are examined. The results show that the scheduling process discussed in 2.3 must not only take computing power and memory size of each processor into consideration, but they also have to pay close attention to the maximum number of inputs/outputs and the communication load of the system. With our simulation environment feeding back important factors, we wish to find an algorithm to solve the scheduling problem
`systematically and efficiently.
Also, the buffer size at the input/output of a processor has impact on the system performance. If the buffer size is unlimited, data transmission always finishes as soon as possible since there is always space for data inside a processor. However, since on-chip memory is very expensive, this can never be fulfilled. We wish to solve the problem so that the total buffer size is minimized while the system performance is maintained.
Last, since virtual channels may work at full bandwidth of the physical channel if
no other data is transmitted along these virtual channels at the same time. We wish to elaborate an accurate model of traffic contention and improve our task binding algorithm further so that each processor can be utilized to the ultimate. In this way, higher system performance can then be achieved without having to add any resource.
Reference
[1] Axel Jantsch, and Hannu Tenhunen, Networks on Chip, Kluwer Academic Publishers, 2003
[2] Adrijean Adriahantenaina, Hervé Charlery, Alain Greiner, Laurent Mortiez and Cesar Albenes Zeferin, “SPIN: a scalable, packet switched, on-chip micro-network,” in Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, 2003, supplements 70 – 73
[3] Cesar Albenes Zeferino and Altamiro Amadeu Susin, “SoCIN: a parametric and scalable network-on-chip,” in Proceedings of the 16th Symposium on Integrated Circuits and Systems Design, 2003, pages 169 – 174
[4] Partha Pratim Pande, Cristian Grecu, Andre Ivanov and Res Saleh, “Design of a switch for network on chip applications,“ in Proceedings of the 2003 International Symposium on Circuits and Systems, 2003, volume 5, pages 217 –
220
[5] Luca Benini and Giovanni De Micheli, “Networks on chips: a new SoC paradigm,” in Computer , 2003, volume 35, issue 1, pages 70 -78
[6] Shashi Kumar, Axel Jantsch, Juha-Pekka Soininen, Martti Forsell, Mikael Millberg, Johny Öberg, Kari Tiensyrjä and Ahmed Hemani, “A network on chip architecture and design methodology,“ in Proceedings of the IEEE Computer Society Annual Symposium on VLSI, 2002, pages 105 – 112
[7] Daniel Wiklund and Dake Liu, “SoCBUS: switched network on chip for hard real time embedded systems,” in Proceedings of the International Parallel and Distributed Processing Symposium, 2003, pages 78 – 85
[8] Doris Ching, Patrick Schaumont and Ingrid Verbauwhede, “Integrated modeling
and generation of a reconfigurable network-on-chip,” in Proceedings of the 18th International Parallel and Distributed Processing Symposium, 2004, pages
139 – 145
[9] Marcello Coppola, Stephane Curaba, Miltos D. Grammatikakis, Giuseppe Maruccia and Francesco Papariello, “OCCN: a network-on-chip modeling and simulation framework,” in Proceedings of the Design, Automation and Test in Europe Conference and Exhibition Designers’ Forum, 2004, volume 3, pages
174 – 179
[10] Robert Siegmund and Dietmar Muller, “Efficient modeling and synthesis of on-chip communication protocols for network-on-chip design,” in Proceedings of the 2003 International Symposium on Circuits and Systems, 2003, volume 5,
pages 81 – 84
[11] Jan Madsen, Shankar Mahadevan, Kashif Virk and Mercury Gonzalez,
“Network-on-chip modeling for system-level multiprocessor simulation,” in Proceedings of the 24th IEEE International Real-Time Systems Symposium,
pages 265 – 274
[12] Kenichiro Anjo, Yutaka Yamada, Michihiro Koibuchi, Akiya Jouraku and Hideharu Amano, "BLACK-BUS: a new data-transfer technique using local address on networks-on-chips," in Proceedings of the 18th International Parallel and Distributed Processing Symposium, 2004, pages 10 – 17
[13] Mikael Millberg, Erland Nilsson, Rikard Thid, Shashi Kumar and Axel Jantsch,
"The Nostrum backbone - a communication protocol stack for networks on chip," in Proceedings of the 17th International Conference on VLSI Design, 2004, pages 693 – 696
[14] Tang Lei and Shashi Kumar, “A two-step genetic algorithm for mapping task graphs to a network on chip architecture,” in Proceedings of the Euromicro
Symposium on Digital System Design, 2003, pages 180 – 187
[15] William J. Dally, “Performance analysis of a k-ary n-cube interconnect networks,” in IEEE Transactions on Computers, 1990, pages 775 – 785
[16] Jose Duato, Sudhakar Yalamanchili, Lionel Ni and Lionel M. Ni, Interconnection Networks: An Engineering Approach, Institute of Electrical &
Electronics Enginee, 1997
[17] Li-Shiuan Peh and William J. Dally, “A delay model and speculative architecture for pipelined routers,” in The 7th International Symposium on High-Performance Computer Architecture, 2001, pages 255 – 266
[17] Li-Shiuan Peh and William J. Dally, “A delay model and speculative architecture for pipelined routers,” in The 7th International Symposium on High-Performance Computer Architecture, 2001, pages 255 – 266