Chapter 1 Introduction
1.4 Thesis Organization
The rest in this thesis is organized as follows. Chapter 2 introduces related works, our design flow and some basic concepts. Chapter 3 presents our task scheduling method. We use genetic algorithm with shape crossover method to reduce communication overhead, furthermore, we use partition method to improve the crossover procedure, which can lead to better throughput between generations. The experimental flow and results are shown and discussed in Chapter 4. Finally, conclusions and future works are given in the last chapter.
Chapter 2
Preliminary
In this chapter, we first introduce several related works in design methodology and scheduling. Then we talk about our design flow and NoC platform. The architecture and characteristics of our NoC platform is presented here. Finally, we introduce the concept of genetic algorithm and why it is suitable to deal with task scheduling problem.
2.1 Related Works
2.1.1 Design Methodology
There are many researches in NoC domain. In [6], it proposes using layered-micronetwork design methodology to address future SoC designs. As shown in Figure 2.1, every layer is specialized and optimized for target application domain in this vertical design flow.
Figure 2.1:Layered-micronetwork design methodology
In [10], a circuited switched two-dimensional mesh network called SoCBUS is proposed. It introduces the concept of packet connected circuit (PCC). By this theory, packet is switched through the network and locking the circuit as it goes. PCC is similar to circuit switching which has the advantage of bandwidth guarantee and deadlock-free. The integrated modeling, simulation and implementation environment are proposed In [11]. NoC infrastructure and processors are modeled, and simulation is performed to find the optimal network configuration.
[12] presents the Xpipes which contains a library of soft macros (switches, network interfaces and links), therefore, domain-specific heterogeneous architectures can be instantiated and synthesized. Xpipes provides a tool called XpipeCompiler, which can automatically instantiates a customized NoC from the library. Precisely, designer uses the library from Xpipes to describe the network architecture, and the information on the network architecture is specified in an input file for XpipeComplier. This tool can generate a SystemC hierarchical description of whole system, and it can be compiled and simulated at the cycle-accurate and signal-accurate level. [13] presents an algorithm called NMAP. It can map cores onto NOC architecture under bandwidth constraints. This can be used for both single-path routing and spilt-traffic routing. In [14], the author uses a simple packet switching communication model to reduce communication time. He proposes a two-step genetic algorithm to map a parameterized task graph onto 2D-mesh NoC architecture, which minimizes the overall execution time of the task graph.
2.1.2 Scheduling
In [15], we get the basic algorithm and concept about using genetic algorithm to deal with multiprocessor scheduling problem. We learn the partition skill to handle genetic algorithm in [16]. It provides the method to divide task graph into several partitions according to the execution time relation. Then it analyzes the benefit of partition genetic algorithm and show the experimental result to prove it. In [17] we can get the graphic based crossover method and chromosome representation thought.
It considers the communication overhead and data dependency among crossover.
Since NoC is a communication-driven architecture, we consider the case when communication is the bottleneck of system. Thus, crossover with lower communication overhead can get great improvement in system performance.
In [17] it proposes new crossover schemes which take the dependency of tasks into consideration to obtain better performance. As shown in Figure 2.2, it uses a graphical chromosome which contains the information of task graph and the allocations of tasks. For example, the top node of the chromosome indicates that task A maps to (0,0), the bottom node indicates that task E maps to (1,1). Thus, this kind of representation can include the data dependency information and this is a great innovation.
Figure 2.2:Graphic-based chromosome representation
The first proposed crossover scheme in [17] is the sub-graph crossover operator which exchanges a sub-graph in a well-coded graph-based representation. The exchange process of the sub-graph crossover is illustrated in Figure 2.3. First of all, it randomly chooses a task on the task graph. For example, in Figure 2.3(a), the task F
which locates at (2,2) is chosen for Parent X and the task F which locates at (3,0) is chosen for Parent Y. Next, it selects a number x ranges from 1 to n-1 randomly, where n is the total task number of the graph. In this case, the Parent X, as shown in Figure 2.3(a), has the number x of 5. Finally, a breadth first search (BFS) is performed on chosen task until the number of visiting tasks reaches x. As a result, it can obtain the sub-graphs in Figure 2.3(a), and it labels the communication amount for the cutting edges. For instance, the communication amount between the Task A and the Task C is 4. Finally, we exchange sub-graphs to generate the offspring X’ and Y’, the result is shown in Figure 2.3(b).
0,0
Parent X Parent Y
D
Offspring X’ Offspring Y’
(a) (b)
Figure 2.3:Sub-graph crossover operation
Although sub-graph crossover considers about the dependency of tasks, it is still not good enough. It can be further improved by taking the suitability between the parents and the exchanged sub-graph into account. Therefore, the author presents another crossover method by considering the communication overhead between sub-graph and surrounding tasks.
Second, in order to ease the communication cost, it proposes systematic rotation and reflection scheme to adjust one shape diagram to increase search space. For simple demonstration, in this case, it can rotate and reflect SB to obtain eight candidates as shown in Figure 2.4 named as SB1, SB2, SB3, and so on, where SB is identical to SB1. These candidates are evaluated by calculating the communication cost that they cause. The communication cost is defined as Σci*di, where ci is the input or output communication amount of the sub-graph and di is Manhattan distance of that communication. After calculating the communication cost for the eight candidates, it selects the candidate with the minimum communication cost as the final shape result for exchange.
Figure 2.4:Rotation and reflection of SB for the shape crossover operation B
2.2 Our Design Flow
Our design flow is shown in Figure 2.5. There are two input information in our methodology. First, an application can be partitioned into communicating tasks, and the characteristics of tasks and data dependency is modeled as a task graph. Second, the NoC platform contains network architecture and heterogeneous computing resources (the task graph and NoC platform will be later explicitly explained). The task scheduling process determines which task should map to which resource. The process not only tries to reduce the communication time by mapping the interacting tasks into the same resource (make it an intra-resource communication) under memory constraints, but also tries to map tasks onto most appropriate resources to improve the computation time of each task. Next, the routing process [18] assigns a specific connect path for each communication between tasks. After the routing process, we can make a system performance analysis. If the results do not meet our requirement, we will iteratively refine our application or NoC platform and execute task scheduling and routing until the results satisfy our requirement.
Figure 2.5:Design flow
2.3 Our NoC Platform
As Figure 2.6, our NoC platform consists of switches and processing elements, each switch connects to neighbor switches and a corresponding processing element, and all of these components construct the network architecture. Processing elements can communicate with each other by passing messages through the switches of the network.
switch switch switch
switch switch switch
switch switch switch
PE PE PE
PE PE
PE
PE PE PE
switch switch switch
switch switch switch
switch switch switch
PE PE PE
PE PE
PE
PE PE PE
Figure 2.6:NoC platform
The architecture of our switch is shown in Figure 2.7. The switch has four ports connecting to neighboring switches and one port connecting to local processing element. Each port is composed of input and output stage.
Figure 2.7: Switch architecture
As shown in Figure 2.8, the interface of switch is composed of input and output channel. Each channel includes Address-line, Data-line and Ack-line. The Address-line delivers the input or output address of the packet. The Data-line delivers data transmitted. And the Ack-line feeds acknowledgement back to source switch or processing elements to report the result of transmission. Output channel and input channel are complementary to each other.
Figure 2.8: Switch interface
Our platform has five features:
(1) circuit switching
(2) dedicated connection path (3) virtual channel flow control (4) weighted round-robin scheduling (5) pipeline bus
Feature (1) and (2) provide the bandwidth guarantee and small memory usage of network switches. Feature (3) and (4) can prevent deadlock and improve the utilization of network. Finally, feature (5) can improve the performance of network.
The details of switch and network architecture are explicitly described in [18].
There are two kinds of processing element in our NoC platform, processor and FPGA. This makes the NoC platform a fully programmable platform. It undoubted that processor is a programmable processing element. FPGA is a dedicated hardware that can be reconfigured when designing. Since our platform is fully programmable, we can reduce the development cost by reuse it in many different applications without any architectural modification.
The processor is highly flexible processing element. It can execute tasks with nice management. But in most cases, processors cannot provide better performance than a dedicated hardware in executing tasks with data dependency. On the other hand, dedicated hardware cannot have good flexibility like processors. Hence, our platform contains another type of processing element to overcome this issue. An FPGA work like a dedicated hardware, but it has the advantage of being reconfigured in design period. Consequently, our platform has the ability to execute various tasks efficiently.
Figure 2.9 shows both the processor and FPGA model. Every processing element contains a network interface(NI) to communicate with local switch. The buffer is temporary memory which uses for storing the input and output data when communicating with other processing elements. As before-mentioned, our platform is consisting of two different types of processing elements. FPGA contains a FPGA core, and processor contains a processor core and local memory, which can store the program and intermediate data in execution.
buffer NI
FPGA buffer
NI
FPGA buffer
NI
Processor
memory buffer
NI
Processor
memory
Figure 2.9:Processing element model
2.3.1 Task Graph
Applications can partition into many tasks due to the parallelism. Figure 2.10 shows a task graph example which is a H.263 decoder. A node represents a task and it functionality. Take node C as an example, it functionality is IDCT which performs an inverse discrete cosine transformation of a frame produced by task B. The edge represents a data transmission and it communication amount. For instance, when task B has completed, it transmit c2 unit data to task C. An edge also shows the data dependency between tasks, a task cannot be executed until it receives the data from all its predecessor. For example, task G cannot be executed until it receives c5 unit data from task D and c6 unit data from task E, this can insure the correctness of program.
B
Figure 2.10:Task graph of H.263 decoder
In addition to the task graph, there is a processing element database to specify the details of tasks when performing on the specific processing elements. As shown in Figure 2.11, processing element database contains the executing time of the task and the memory usage (program and intermediate data) when executing on a processor. If the task is executing on an FPGA, it shows the execution time and the capacity usage (logics) of the task.
FPGA
e : executing time m : memory usage
c : capacity usage
∞ , ∞
e : executing time m : memory usage
c : capacity usage
Figure 2.11:Resource requirement of tasks
2.3.2 Performance Evaluation
Since the application is executed consecutively, we take throughput instead of execution time as the system performance metric. Take video compressing as an example, we may compress the entire movie into a more compact form, like Mpeg4.
A movie may contain thousands of frames, therefore, when we evaluate the system performance of the video compressing ability, we may take frame per second as the rating of system performance rather than second per frame. As a result, we take throughput as the metric of system performance. More precisely, our system performance evaluation is to calculate how many times the application can be performed in a period.
2.4 Genetic Algorithms
Task scheduling tries to allocate a set of tasks to resources such that the performance is optimal. Nevertheless, it is known as NP-complete. Therefore, people often use heuristic algorithms to deal with task scheduling problem [14][18][19][20][21].
In considering about the system performance, there are several important aspects. First, since the NoC platform contains heterogeneous computing resources, and tasks maybe suited to be executed on some kind of resources, thus, the execution time of a task depends on the resource it used. Second, the communication time
between tasks highly depends on the communication distance. Therefore, communication time can be greatly improved by mapping the communicating tasks onto the same resources. However, this may violate the constraints as mention before.
Moreover, the suitability of tasks and resources are neglected. As the result, the task scheduling problem must be solved with considering the trade-off among execution time, communication time and constraints.
Typically, Genetic algorithms are good at finding near-optimal solutions in a large search-space. As well, genetic algorithms do not require the knowledge of the search-space, they only need a measure of solution, it differ from many traditional optimization techniques [14][22][23]. In other words, we do not need to know how to arrange these tasks to get the best performance, we only have to define the performance of the solution. As a result, Genetic algorithms are quite suitable for the task scheduling problems.
Genetic algorithms are search algorithms based on the mechanics of natural selection and natural genetics. Genetic algorithms are different from other traditional optimization methods in very four fundamental ways [22] :
(1) Genetic algorithms use a coding of the parameter set instead of parameters themselves.
(2) Genetic algorithms search from a population of search nodes rather than a single one.
(3) Genetic algorithms use objective function, not derivatives or other
auxiliary knowledge.
(4) Genetic algorithms use probabilistic transition rules, not deterministic rules.
In order to employ genetic algorithms to solve our problem, first we need to encode the possible solutions of the optimization problems as a set of chromosomes.
Each chromosome represents a solution to the problem, and a cluster of solutions form a population. Next we generate the initial population, the chromosomes in the initial population are often generated randomly or heuristically. After that, we have to evaluate the fitness value of the chromosomes, this can judge how good the chromosome is to the problem, and it is important in the procedure of evolution.
In evolution process, we optimize the population by using genetic operators:
selection, crossover and mutation. The genetic algorithms select chromosomes from current generation by their fitness value. The higher fitness value the chromosome has, the higher probability it will be selected. In crossover and mutation procedures, next generation is generated by means of exploring the search-space. Finally, we evaluate the fitness value of chromosomes in the next generation, then add them into the current generation. In order to keep the size of population, some bad chromosome will be discarded. We can pick the better result generation by generation until the saturation condition is met, and we can find the solution in the best chromosome when genetic algorithms terminate.
Chapter 3
Task Scheduling
In this chapter, the proposed task scheduling method is presented. Procedures of our method will be explained explicitly in following articles. In addition, we will discuss the partition work and communication improvement in our algorithm into depth. We believe that proposed algorithm can improve the scheduling result and shorten scheduling time obviously.
3.1 Assumption
First, we need to define the constraints and make some assumptions.
There are two ways to implement a task, software(program) or hardware (logic).
Since local memory of processor and total capacity of the FPGA is limited, we should consider these two constraints. Memory constraint of processor restricts the size of programs and intermediate data of tasks. The capacity constraint of an FPGA is similar to memory constraint. It limits the total logics of tasks which are assigned to an FPGA.
There should be some buffers for performing a task. For instance, in Figure 3.1 we can see, before task A is executed, it has to wait for 4 units data from task B and 2 units data from task C. Totally, task A needs 6 units input buffer for these input data.
Besides, when task A is executing, the generated result needs to be stored in output buffer. Therefore, the minimum buffer requirement is 9 units.
D C B
A
4 2
3 D
C B
A
4 2
3
Figure 3.1:Example of task graph
However, this is not sufficient if our buffer has only the summation of input data and output data. First, when task A is executing, it receives 4 units data from task B and 2 units data form task C and stores them in input buffer. Moreover, the output buffer should prepare 3 units data for result of task A, and these requirements fill the buffer. Thus, neither task B nor task C can transmit data to task A until this job is finished, and this prevent the system from being executed continuously. Second, if output buffer of task A is full (task D does not finish receiving data from task A), it has to wait until the output buffer is clear, which means that it is idling during this period. Hence, in order to overcome this problem, we set our minimum buffer requirement to 18 units, which is twice of the minimum buffer requirement. Then the task can receive and transmit data whenever task is executed or not. This can greatly improve the system performance.
We don’t need to decide the execution order when more than two tasks are allocated to the same FPGA, because tasks are implemented in different parts of FPGA, none of them use the same component of the FPGA. But when more than two tasks are allocated to the same processor, the situation is different, we have to decide the execution order dynamically. It is not fitting to decide the execution order of the tasks in advance because of the dynamic behavior of communication. In addition, before the task is executed, it has to wait for all its input data.
According to above reasons, we choose a dynamic First In First Serve(FIFS) strategy to determine the execution order of tasks. It has two advantages.
(1)Flexibility to conquer the uncertainty of network. (2)Raise the utilization of processor by considering about the data availability of tasks [24]. A FIFS strategy is implemented as a queue. The task is pushed into the queue when all the input data are ready and output buffer size is sufficient. Then processor can execute the tasks according to the order.
3.2 Problem Formulation
Task scheduling problem can be formulated as follows:
Given:
(1) A task graph G(V,E) with communication and computation information (2) An NoC platform with following characteristics:
(a) mesh size
(b) memory size of processor (c) capacity size of FPGA
(d) buffer size of processing element
(e) communication bandwidth of each channel
Goal:
Use our algorithm to efficiently schedule each task to maximize system throughput
3.3 GA-based Task Scheduling Flow
The proposed GA-based task scheduling flow is illustrated in Figure 3.2. It includes the following procedures. Initial population and evolution composed of selection, crossover, mutation, simulation and insertion. First, we generate the initial
The proposed GA-based task scheduling flow is illustrated in Figure 3.2. It includes the following procedures. Initial population and evolution composed of selection, crossover, mutation, simulation and insertion. First, we generate the initial