Chapter 2 Preliminary
2.3 Our NoC platform
2.3.2 Performance evaluation
Processor (e, m) VLD
e : executing time m : memory usage
c : capacity usage
∞ , ∞ Processor (e, m) VLD
e : executing time m : memory usage
c : capacity usage
Figure 7 : Processing element database
2.3.2 Performance evaluation
Since the application is not being executed only once but consecutively, we take throughput as the system performance metric, instead of the overall execution time of the application. Take the video compressing as an example. We may compress the entire movie to a more compact form, e.g. Mpeg4. A movie may contain thousands of frames. Therefore, when we are evaluating the system performance of the ability of video compressing, we may take frame per second as the rating of system performance but not second per frame.
As a result, we take throughput as the metric of the system performance. More precisely, our system performance evaluation is to calculate how many times the application (task graph) can be executed in a fixed time period.
Chapter 3 Task Scheduling
3.1 Assumption
Before we formulate our problem, it is necessary to define the constraints and make some assumptions.
A task can be implemented by software (program) or hardware (logics). Since the local memory of processor and total capacity of FPGA is limited, a processor cannot store infinite tasks and an FPGA cannot implement infinite tasks. Therefore, there are two constraints should be considered. The first, memory constraint of processor means that the size of the programs and intermediate data of the tasks which are stored in a processor cannot exceed the memory size of processor. The second, capacity constraint of FPGA is
FPGA cannot exceed maximum capacity of FPGA.
There should be some buffer for executing a task. For example, as shown in Figure 8, task A will be executed until it receives 4 units data from task B and 2 units data from task C. Task A needs 6 unit buffer for storing these input data temporarily. Thus the minimum requirement of input buffer in task A is 6 units. Similarly, when task A is executing, the generated output data need to be stored in output buffer. So the minimum requirement of output buffer here is 3 units. Finally, the minimum buffer requirement is 9 (6 + 3) units.
D C B
A
4 2
3 D
C B
A
4 2
3
Figure 8 : Task graph example
However, it is not efficient if our buffer is only the sum of input data and output data for two reasons. First, if task A is executing, there are 4 units data from task B and 2 units data from task C in input buffer, and output buffer should prepare 3 units for task A, which means that the buffer is full. Consequently, neither task B nor task C can transmit data to task A until task A finish executing. This prevents task A from being executed continuously.
Second, if task A is ready to execute, and the output buffer is full (task D does not receive data or finish receiving data from task A), task A may idle until the output buffer is clear
degraded if the buffer size is only the minimum buffer requirement. Obviously, if we set our minimum buffer requirement equals to 18 units (twice of the minimum buffer requirement). The buffer works like pingpong buffer. Then, the task can receive data or transmit data no matter when the task is executing or not. It greatly improves the system performance. Hence, the reasonable buffer requirement is set to twice of the sum of input data and output data.
As mentioned before, the sum of the reasonable buffer requirement of the tasks, which are implemented with the same PE, cannot exceed the maximum capacity of buffer of PE.
If more than two tasks that are implemented on the same FPGA, it is unnecessary to decide execution order of these tasks. Since the tasks are implemented in different parts of FPGA, none of them share the same component of FPGA.
If more than two tasks that are implemented on the same processor, we make the decisions of the execution order of these tasks dynamically. Due to the dynamic behavior of communication in on-chip network, it is not suitable to decide the execution order of the tasks in design time. In addition, the application is represented as a task graph (dataflow graph) that a task is never being executed until its input data arrive.
According to these two reasons, it is suitable to use a dynamic First In First Serve (FIFS) strategy to decide the execution order of tasks. It is not only flexible to overcome the uncertainty of network, but also considering about the data availability of tasks to raise the utilization of processor [15].
A FIFS strategy is implemented as a queue. If all the input data are available and output buffer size is enough, we push this task into the queue. The processor executes the tasks sequentially in order. The FIFS strategy can be further improved by either considering the data dependency or replacing it by other algorithms.
3.2 Problem formulation
The task scheduling problem can be formulated as : Given :
(1) A task graph G(V, E) and the corresponding processing element database.
(2) An NoC platform which has the following characteristics:
(a) mesh size,
(b) local memory size of processor, (c) total capacity size of FPGA, (d) buffer size of processing element,
(e) communication bandwidth of each channel, with :
(1) memory and capacity constraints, and (2) buffer constraint.
Determine :
The allocation of each task such that system throughput is maximized.
3.3 Genetic algorithms
Basically, task scheduling is simply to allocate a set of tasks to resources such that the performance is optimal. However, it is known as NP-complete. Thus, task scheduling problem is often handled by heuristic algorithms [8][11][13][14][17].
Nevertheless, there are several important facets that influence the system performance.
First, since the NoC platform contains heterogeneous computing resources, for example, a task may be suited to be executed on processor rather than on FPGA. Therefore, the execution time of a task depends on what resource that it uses. Second, the communication time between tasks highly depends on the communication distance of the resources. The communication time can be greatly improved by mapping the communicating tasks onto the same resources. However, this may violate the constraints as mentioned before.
Moreover, suitability of tasks and resources are not considered. As the result, the task scheduling problem involves the trade-off among the execution time, communication time and constraints.
Typically, genetic algorithms (GAs) provide good performance at finding near-optimal solutions in a large search space. Also, unlike many traditional optimization techniques, genetic algorithms do not require the knowledge of the search-space, but need only a measure of the solution [13][18][19]. Consequently, genetic algorithms are quite suitable for the task scheduling problem.
and natural genetics. GAs are differ from other traditional optimization methods in four fundamental ways [18] :
(1) GAs work with a coding of the parameter set, not the parameter themselves.
(2) GAs search from a population of points, not a single point.
(3) GAs use payoff (objective function) information, not derivatives or other auxiliary knowledge.
(4) GAs use probabilistic transition rules, not deterministic rules.
The first step to employ GAs is to encode the possible solutions of the optimization problem as a set of chromosomes (the encoding scheme may differ form problem to problem, however the simplest way is to encode it into a string). Each chromosome represents a solution to the problem. And a set of solutions is referred to as a population.
The next step is to generate an initial population. The chromosomes in the initial population are often generated randomly or heuristically. The initial population is also called the first generation of the evolution. Then, it is necessary to evaluate the fitness of the chromosomes, where the fitness value represents how good (fit) the chromosome is to the problem (environment).
Next, the GAs perform evolution process to optimize the population generation by generation using genetic operators: selection, mating, and mutation. During the evolution process, the GAs select chromosomes from current generation according to their fitness value, where the higher fitness the chromosome has, the higher probability it will be selected. By performing mating and mutation to the selected chromosomes, the next
in the next generation are evaluated to obtain its fitness value, and then add the next generation to the current generation. Some bad chromosomes in the population may be discarded to keep a fixed-size population.
Finally, the GAs continue evolution process until the termination condition has bean met. When the GAs terminates, the best chromosome is the final result to the problem.
3.4 GA-based task scheduling flow
The GA-based task scheduling flow is illustrated in Figure 9. First, we generate an initial population. Next, the evolution process tries to explore the search space until it reaches the termination condition. Finally, the best chromosome in the population is our solution.
Saturation ? Saturation ?
Evolution Evolution
Finish Finish Initial Population Initial Population
no
yes Saturation ? Saturation ?
Evolution Evolution
Finish Finish Initial Population Initial Population
no
yes
Figure 9 : Task scheduling flow
3.5 Initial population
For initial population, each chromosome is generated using a meta-random scheme which is divided into two steps:
(1) The tasks in the task graph are sorted in topological order.
(2) The tasks are mapped onto the NoC platform sequentially in this order.
During step 2, we must consider 3 conditions:
(a) If the task has no precedence, the task is mapped randomly.
(b) If the task has only one precedence, the task is mapped according to the allocation of its precedence.
(c) If the task has two or more than two precedence, the task is mapped according to not only the allocation of its precedence but also the communication amount between the task and its precedence.
Take the task graph in Figure 10 as an example. First, we perform topological sort on the task graph, and the topological order is given by A, B, C, D. Next, task A is randomly mapped to the NoC platform. Then, task B and task C are mapped according to the allocation of task A. As shown as Figure 11, task B and task C have higher probability to be mapped onto the allocations that close to task A. Finally, task D is mapped according to the allocations of task B and task C. Obviously, edge B→D and edge C→D has different communication amount. Therefore, the probability should be higher for the allocations that near to task B than those near to task C. Figure 11 shows how to calculate the probability
D
Figure 10 : Task graph example
A
sequence Probability Allocation
A
sequence Probability Allocation
A
Figure 11 : Generate an initial solution
During the process of generating a chromosome, the constraints are also needed to be considered. The tasks cannot be assigned to the allocations in which the constraints may be violated. The initial population is generated with a fixed number of chromosomes which is generated by the meta-random scheme. Then, the fitness value of each chromosome is evaluated.
There are two reasons why we use a meta-random scheme to generate a chromosome.
First, a pure random scheme may cause a very bad performance. Second, the diversity of the chromosomes in the initial population should be kept as high as possible so that the GAs have higher probability to explore larger search-space. Due to these two reasons, we use a meta-random scheme to generate the chromosomes which not only consider the performance issue but also the diversity issue.
3.6 Evolution
GAs try to explore the search space using the three genetic operators: selection, mating, and mutation. The evolution flow is illustrated in Figure 12.
Saturation ?
3.6.1 Selection
Due to the principle of eugenics, an individual (chromosome) which has higher fitness value has higher probability to produce next generation. Therefore, we select pairs of parents from the population using roulette wheel method [18]. Each chromosome in the population has roulette wheel slot sized in proportion to its fitness value. Then the chromosome is selected by spinning the roulette wheel. Take Figure 13 as an example, chromosome A has the largest fitness value, so it occupies the largest size in the roulette wheel. By spinning roulette wheel many times, the selected chromosomes are going to mate in the next step.
A
chromosome
BC DE
5 fitness
42 31
A
B C
D
E
spin
A
chromosome
BC DE
5 fitness
42 31
A
B C
D
E
spin
A
B C
D
E
spin
Figure 13 : Roulette wheel method
3.6.2 Mating
GAs use mating to explore the search space and try to find the local optimal. In the nature, the children inherit the features from parents. For example, if parents have big eyes, their children usually have big eyes, too. So as in GAs, the generated chromosomes inherit
First of all, it is needed to explain how the traditional mating scheme works. But, before we talk about the traditional mating scheme, it is necessary to introduce the traditional representation of chromosomes. Each chromosome is represented as a string, and each word in the string represents the allocation of the corresponding task. As shown in Figure 14, the chromosome is represented as a string {(0,0), (1,1), (2,1), (3,1), (1,1)}, which indicates that task A is in (0,0), task B is in (1,1), and so on.
A B E
C D
{(0,0), (1,1), (2,1), (3,1), (1,1)}
(row, col)
col
row A
B E C D A
B E C D
{(0,0), (1,1), (2,1), (3,1), (1,1)}
(row, col)
col
row
Figure 14 : Traditional chromosome representation
The traditional mating schemes consist of single-point crossover, two-point crossover, etc. As illustrate in Figure 15, single-point crossover first randomly selects a cross point of two parents, and then exchange the sub-string between the cross point and the end of the string. As the name implies, two-point crossover use two randomly selected cross points to choose the sub-string to be exchanged. However, both of these two mating schemes do not consider the dependency of tasks.
Parents
Children
cross point
Parents
Children
cross point Single-point crossover Two-point crossover Parents
Children
cross point
Parents
Children
cross point Single-point crossover Two-point crossover
Unlike the traditional mating schemes, we propose two different mating schemes which consider the dependency of tasks to obtain better performance in communication.
Different from the traditional representations, the representation of our chromosome is a graph, where each word in the vertex indicates the allocations of corresponding tasks. As illustrated in Figure 16, the top vertex of the chromosome indicates that task A is in (0,0), the top-right vertex indicates that task B is in (1,1), and so on. Hence, our representation is capable of representing the dependency between tasks where the string representation cannot provide these important information.
A
Task graph Chromosome
A
Task graph Chromosome
Figure 16 : Our chromosome representation
The first mating scheme we proposed is sub-graph crossover which exchanges a sub-graph in a well-coded representation. Figure 17 illustrates the exchanging process of sub-graph crossover. At first, we randomly select a number x between 1~n-1 (where n is the total task number). Second, we randomly choose a task at the task graph and then perform breadth first search (BFS) starting from task until the number of visiting tasks reaches x. At last, the sub-graph is found, and we can exchange the sub-graph to produce the next generation.
0,0
exchange B 0,0
1,2
Figure 17 : Sub-graph crossover
Although sub-graph crossover considers the dependency of tasks, it is still not good enough. It can be further improved by taking the suitability between the parents and the exchanged sub-graph into account. The higher fitness the chromosome has, the higher probability the chromosome will be selected to be parent. In other words, the selected parents usually provide good performance. It is not wise to change the parents in a big way, because this may destroy the original structure of parents and then get a bad chromosome.
Hence, shape crossover is proposed to raise the suitability between the parents and the exchanged sub-graph. As shown in Figure 18, the allocations of tasks in the sub-graph construct a shape. Obviously, if we exchange SA and SB directly in the absolute position, it will destroy the original structure. Since the communication between original A and SA
may be good. But after exchanging, original A and SB may be too far to communicate to each other. Therefore, the clever way is to exchange SB and SA in the relative position, and then the structure is not destroyed but makes a little change.
0,0
The details of shape crossover are described as following steps:
Assume dad A and mom B produce son C.
(1) Randomly choose a sub-graph, and then find the allocations of the corresponding tasks, which construct a shape. Take Figure 18 as an example, the allocations of the corresponding tasks are SA and SB, respectively.
(2) Rotate and reflect SB in 8 conditions (rotate 0°, 90°, 180°, 270° and reflect the above) which is illustrated in Figure 19. And then shift these shapes to an appropriate position which makes the gravity center of each shape as close as to that of SA. Figure 20 shows the process of shifting SB8 to SA.
reflect
rotate 90° rotate 180° rotate 270°
rotate 0°
reflect reflect reflect
SB1 SB2 SB3 SB4
SB5 SB6 SB7 SB8
reflect
rotate 90° rotate 180° rotate 270°
rotate 0°
reflect reflect reflect
SB1 SB2 SB3 SB4
SB5 SB6 SB7 SB8
gravity center
of SA gravity center
of SB8 shift
Figure 20 : Shift SB8 close to SA
(3) After step (2), we get 8 solutions. It is necessary to estimate which solution is the best among these 8 solutions. The way we estimate these solutions is to calculate the communication overhead that they cause. The communication overhead is defined as Σci*di where ci is input or output communication amount of the sub-graph and di is the Manhattan distance of that communication. For example, the communication
overhead of SB8 = 5*1+4*4+2*2+3*1+2*3+1*4 = 38 which is shown in Figure 21.
The first term 5*1 is the top left vertex of the sub-graph where 5 is the communication amount, 1 is the Manhattan distance between (1,2) and (1,1), and so on.
0,0 1,2
1,1
2,2
2,1
2,0 1,3 1,2
1,2 1,1
1,3 0,0
1,2
1,1
2,2
2,1
2,0 1,3 1,2
1,2 1,1
2 1,3
1
3 2 5
4
Figure 21 : Communication overhead of SB8
After calculating the communication overhead of these 8 solutions, the final result is the solution with minimum communication overhead that causes by SB.
During the step (2), some of the tasks may violate the constraints and lead to an infeasible solution. Therefore, we must repair the tasks which violate the constraints. The repair method is similar to the second step in generating the chromosome of the initial
population. The different is that we just map the tasks which violate the constraints but not all tasks in the task graph.
3.6.3 Mutation
The goal of mutation is to prevent GAs from finding just local optimal. By randomly change the feature of the chromosome, the chromosome may have the opportunity to reach or get close to the global optimal. Our mutation scheme first selects a task at random. Next, the selected task has a probability to move to a random allocation. Also, the new allocation must satisfy the constraints. Here is an example in Figure 22.
D C E B
A
1. Randomly select a task : task E
D C B E
A
2. Randomly move to new allocation D C
E B A
1. Randomly select a task : task E
D C B E
A
2. Randomly move to new allocation Figure 22 : Mutation example
3.6.4 Simulation
After mating and mutation, it is necessary to evaluate the fitness value of each new generated chromosome. We use high-level simulation to obtain the throughput of every new generated chromosome. Figure 23 demonstrates our simulation flow.
Saturation ?
Figure 23 : Simulation flow
At first, the buffer length assignment of each task is conducted. We assign input and output buffer to every task equally and make sure that each task has one input and output buffer shown in Figure 24.
buffer
T1 T3
Buffer NI
T1 T3
buffer
T1 T3
Buffer NI
T1 T3
Figure 24 : Buffer length assignment
Second, since there are many dynamic behaviors when executing the application (task graph) consecutively using our platform, and the time to find out the throughput of the chromosome must be short. It is not feasible to use a simple scheduling scheme or a simulation in the cycle-accurate level to obtain the throughput of each chromosome.
Second, since there are many dynamic behaviors when executing the application (task graph) consecutively using our platform, and the time to find out the throughput of the chromosome must be short. It is not feasible to use a simple scheduling scheme or a simulation in the cycle-accurate level to obtain the throughput of each chromosome.