Chapter 1 Introduction
1.4 Thesis organization
The rest in this thesis is organized as follows. Chapter 2 introduces related work and our design flow. Chapter 3 presents the task scheduling method using genetic algorithms.
The experimental result are given and discussed in Chapter 4. Finally, the conclusions and future work are described in Chapter 5.
Chapter 2 Preliminary
2.1 Related works
There are many researches in the NoC domain. By borrowing the models, techniques and tools from network and applying them to SoC design, the authors of [2] proposes a layerd-micronetwork design methodology to address future SoC designs as shown in Figure 2. In this vertical design flow, every layer is specialized and optimized for target application domain. In [7], on-chip interconnection network is used to substitute for ad-hoc global wiring structure. The structured network wiring gives well-controlled electrical parameters that eliminate timing iterations and enable the use of high-performance circuits to reduce latency and increase bandwidth.
Figure 2 : Layerd-micronetwork design methodology
Recently, several NoC platforms and architectures have been proposed in [9][5][10][3]. [9] proposes a packet switched NoC platform, which includes both architecture and design methodology. The architecture is an m × n mesh of switches where the computing resources like processor core, memory, FPGA, a custom hardware or any other Intellectual Property (IP) block are connected to it. This work includes the decision of NoC architecture and the process of mapping the application onto the architecture. The Scalable Programmable Integrated Network (SPIN) is a regular, fat-tree-based network architecture [5], which uses a wormhole routing to reduce the storage requirement of network switch, and the latency of messages. A circuit switched two-dimensional mesh network called SoCBUS is proposed in [10]. [10] introduces the concept of packet connected circuit (PCC), where a packet is switched through the network locking the circuit as it goes. PCC is similar to circuit switching which has the advantages of bandwidth guarantee and deadlock-free.
Some of the synthesis techniques are proposed in [3][11][12][13]. [3] presents the Xpipes which consists of a library of soft macros (switches, network interfaces and links) so that domain-specific heterogeneous architectures can be instantiated and synthesized.
Xpipes provides a tool called XpipeCompiler, which can automatically instantiates a customized NoC from the library form soft network components. Precisely, the designer uses the library from Xpipes to describe the network architecture. The information on the network architecture is then specified in an input file for the XpipeComplier. The tool generates a SystemC hierarchical description of whole system. Then the description can be compiled and simulated at the cycle-accurate and signal-accurate level. [11] presents an algorithm which automatically maps the IP/cores onto a generic regular NoC architecture.
This work develops an algorithm to solve the mapping problem based on branch and bound technique to minimize communication energy consumption under performance constraints.
[12] presents a NMAP algorithm that maps the cores onto NOC architecture under bandwidth constraints. The NMAP can be applied on both single-path routing and split-traffic routing. In [13], the author uses a simple packet switching communication model to estimate the communication time and propose a two-step genetic algorithm to map a parameterized task graph onto the 2D-mesh NoC architecture, which minimizes the overall execution time of the task graph.
2.2 Our design flow
Figure 3 depicts our design flow. Our methodology has two input information. First, an application can be partitioned into communicating tasks. And the characteristics of tasks and their dependency is model as a task graph. Second, the NoC platform contains network architecture and heterogeneous computing resources (the task graph and NoC platform will be later explicitly explained in 2.2.1 and 2.2.2). The task scheduling process decides which task should map onto which resource. The process not only tries to reduce the communication time by mapping the interacting tasks into the same resource (make it an intra-resource communication) under memory constraints but also tries to map tasks onto the most appropriate resources to improve the computation time of each task. Next, the routing process [14] assigns a dedicated connect path for each communication between tasks. After the routing process, we can conduct a system performance analysis. If the results do not meet our requirement, we will iteratively refine our application or NoC platform and perform task scheduling and routing until the results satisfy our requirement.
Application
Task Scheduling
Performance Analysis
Good ?
NoC Platform
Refinement Refinement
no no
yes Routing
Finish Task Graph
Application
Task Scheduling
Performance Analysis
Good ?
NoC Platform
Refinement Refinement
no no
yes Routing
Finish Task Graph
Figure 3 : Design flow
2.3 Our NoC platform
Our NoC platform is shown in Figure 4. It contains a network architecture constructed by switches, and each switch connects to a corresponding processing element.
switch switch switch
switch switch switch
switch switch switch
PE PE PE
PE PE
PE
PE PE PE
switch switch switch
switch switch switch
switch switch switch
PE PE PE
PE PE
PE
PE PE PE
The processing elements communicate to each other by passing messages through the switches of the network. Our network and switch have the following five features:
(1) circuit switching,
(2) dedicated connection path, (3) virtual channel flow control,
(4) weighted round-robin scheduling, and (5) pipeline bus.
Feature (1) and (2) provide the bandwidth guarantee and small memory usage of network switches. Feature (3) and (4) prevent deadlock and improve the utilization of network.
Finally, feature (5) improves the performance of network. The details of the switch and the network architecture are explicitly described in [14].
Our NoC platform contains two types of processing elements: processor and FPGA.
This makes our NoC platform a fully programmable platform. The FPGA is a dedicated hardware that can be reconfigured at run time. With a fully programmable platform, we can reduce the development cost by reusing our platform for many different applications (different applications with different configurations) without any architectural modification.
The processor is a highly flexible processing element. It is good at executing tasks with the characteristics of controls. But in most cases, processors cannot provide better performance than a dedicated hardware in executing tasks with the characteristics of datapath. On the contrary, dedicated hardwares cannot be as flexible as processors.
Therefore, our platform contains another type of processing element FPGA to overcome
this issue. A FPGA works like a dedicated hardware but it has the advantage of being reconfigured at run time. Consequently, our platform is capable of executing various tasks efficiently.
Figure 5 shows the architectural model of our processing element. Each processing element contains a network interface to communicate with the local switch. The buffer is a temporary memory which stores the input data from other processing elements and output data to other processing elements. As mentioned before, our platform consists of two different types of processing elements. The FPGA contains a FPGA core. The processor contains a processor core and local memory which stores the program and intermediate data when executing a task.
Buffer NI
FPGA Buffer
NI
FPGA Buffer
NI
Processor
Memory Buffer
NI
Processor
Memory
Figure 5 : Processing element model
2.3.1 Task graph
Applications are able to be partitioned into many communication tasks due to the parallelism. Figure 6 shows a task graph of a H.263 decoder [15]. A vertex represents a task and its functionality is shown in the vertex, too. For example, task C is IDCT which
task B being completed, it transmits c2 unit data to task C. An edge also indicates the data dependency. A task cannot be executed until it receives the data from its predecessor. For example, task G cannot be executed until it receives c5 unit data from task D and c6 unit data from task E.
B IQ
C IDCT
E R P-f D
R I-f
c1 A VLD
G out
c2 c4 c3
c5 c6
B IQ
C IDCT
E R P-f D
R I-f
c1 A VLD
G out
c2 c4 c3
c5 c6
Figure 6 : Task graph of H.263 decoder[15]
In addition to the task graph, there is a processing element database to specify the details of tasks when executing on the specific processing elements. As shown in Figure 7, a processing element database first describe the executing time of the task and the memory usage (program and intermediate data) when executing on a processor. If the task is executing on a FPGA, it shows the execution time and the capacity usage (logics) of the task. And ∞ represent that it cannot be executed on that processing element.
∞ , ∞ Processor (e, m) VLD
e : executing time m : memory usage
c : capacity usage
∞ , ∞ Processor (e, m) VLD
e : executing time m : memory usage
c : capacity usage
Figure 7 : Processing element database
2.3.2 Performance evaluation
Since the application is not being executed only once but consecutively, we take throughput as the system performance metric, instead of the overall execution time of the application. Take the video compressing as an example. We may compress the entire movie to a more compact form, e.g. Mpeg4. A movie may contain thousands of frames. Therefore, when we are evaluating the system performance of the ability of video compressing, we may take frame per second as the rating of system performance but not second per frame.
As a result, we take throughput as the metric of the system performance. More precisely, our system performance evaluation is to calculate how many times the application (task graph) can be executed in a fixed time period.
Chapter 3 Task Scheduling
3.1 Assumption
Before we formulate our problem, it is necessary to define the constraints and make some assumptions.
A task can be implemented by software (program) or hardware (logics). Since the local memory of processor and total capacity of FPGA is limited, a processor cannot store infinite tasks and an FPGA cannot implement infinite tasks. Therefore, there are two constraints should be considered. The first, memory constraint of processor means that the size of the programs and intermediate data of the tasks which are stored in a processor cannot exceed the memory size of processor. The second, capacity constraint of FPGA is
FPGA cannot exceed maximum capacity of FPGA.
There should be some buffer for executing a task. For example, as shown in Figure 8, task A will be executed until it receives 4 units data from task B and 2 units data from task C. Task A needs 6 unit buffer for storing these input data temporarily. Thus the minimum requirement of input buffer in task A is 6 units. Similarly, when task A is executing, the generated output data need to be stored in output buffer. So the minimum requirement of output buffer here is 3 units. Finally, the minimum buffer requirement is 9 (6 + 3) units.
D C B
A
4 2
3 D
C B
A
4 2
3
Figure 8 : Task graph example
However, it is not efficient if our buffer is only the sum of input data and output data for two reasons. First, if task A is executing, there are 4 units data from task B and 2 units data from task C in input buffer, and output buffer should prepare 3 units for task A, which means that the buffer is full. Consequently, neither task B nor task C can transmit data to task A until task A finish executing. This prevents task A from being executed continuously.
Second, if task A is ready to execute, and the output buffer is full (task D does not receive data or finish receiving data from task A), task A may idle until the output buffer is clear
degraded if the buffer size is only the minimum buffer requirement. Obviously, if we set our minimum buffer requirement equals to 18 units (twice of the minimum buffer requirement). The buffer works like pingpong buffer. Then, the task can receive data or transmit data no matter when the task is executing or not. It greatly improves the system performance. Hence, the reasonable buffer requirement is set to twice of the sum of input data and output data.
As mentioned before, the sum of the reasonable buffer requirement of the tasks, which are implemented with the same PE, cannot exceed the maximum capacity of buffer of PE.
If more than two tasks that are implemented on the same FPGA, it is unnecessary to decide execution order of these tasks. Since the tasks are implemented in different parts of FPGA, none of them share the same component of FPGA.
If more than two tasks that are implemented on the same processor, we make the decisions of the execution order of these tasks dynamically. Due to the dynamic behavior of communication in on-chip network, it is not suitable to decide the execution order of the tasks in design time. In addition, the application is represented as a task graph (dataflow graph) that a task is never being executed until its input data arrive.
According to these two reasons, it is suitable to use a dynamic First In First Serve (FIFS) strategy to decide the execution order of tasks. It is not only flexible to overcome the uncertainty of network, but also considering about the data availability of tasks to raise the utilization of processor [15].
A FIFS strategy is implemented as a queue. If all the input data are available and output buffer size is enough, we push this task into the queue. The processor executes the tasks sequentially in order. The FIFS strategy can be further improved by either considering the data dependency or replacing it by other algorithms.
3.2 Problem formulation
The task scheduling problem can be formulated as : Given :
(1) A task graph G(V, E) and the corresponding processing element database.
(2) An NoC platform which has the following characteristics:
(a) mesh size,
(b) local memory size of processor, (c) total capacity size of FPGA, (d) buffer size of processing element,
(e) communication bandwidth of each channel, with :
(1) memory and capacity constraints, and (2) buffer constraint.
Determine :
The allocation of each task such that system throughput is maximized.
3.3 Genetic algorithms
Basically, task scheduling is simply to allocate a set of tasks to resources such that the performance is optimal. However, it is known as NP-complete. Thus, task scheduling problem is often handled by heuristic algorithms [8][11][13][14][17].
Nevertheless, there are several important facets that influence the system performance.
First, since the NoC platform contains heterogeneous computing resources, for example, a task may be suited to be executed on processor rather than on FPGA. Therefore, the execution time of a task depends on what resource that it uses. Second, the communication time between tasks highly depends on the communication distance of the resources. The communication time can be greatly improved by mapping the communicating tasks onto the same resources. However, this may violate the constraints as mentioned before.
Moreover, suitability of tasks and resources are not considered. As the result, the task scheduling problem involves the trade-off among the execution time, communication time and constraints.
Typically, genetic algorithms (GAs) provide good performance at finding near-optimal solutions in a large search space. Also, unlike many traditional optimization techniques, genetic algorithms do not require the knowledge of the search-space, but need only a measure of the solution [13][18][19]. Consequently, genetic algorithms are quite suitable for the task scheduling problem.
and natural genetics. GAs are differ from other traditional optimization methods in four fundamental ways [18] :
(1) GAs work with a coding of the parameter set, not the parameter themselves.
(2) GAs search from a population of points, not a single point.
(3) GAs use payoff (objective function) information, not derivatives or other auxiliary knowledge.
(4) GAs use probabilistic transition rules, not deterministic rules.
The first step to employ GAs is to encode the possible solutions of the optimization problem as a set of chromosomes (the encoding scheme may differ form problem to problem, however the simplest way is to encode it into a string). Each chromosome represents a solution to the problem. And a set of solutions is referred to as a population.
The next step is to generate an initial population. The chromosomes in the initial population are often generated randomly or heuristically. The initial population is also called the first generation of the evolution. Then, it is necessary to evaluate the fitness of the chromosomes, where the fitness value represents how good (fit) the chromosome is to the problem (environment).
Next, the GAs perform evolution process to optimize the population generation by generation using genetic operators: selection, mating, and mutation. During the evolution process, the GAs select chromosomes from current generation according to their fitness value, where the higher fitness the chromosome has, the higher probability it will be selected. By performing mating and mutation to the selected chromosomes, the next
in the next generation are evaluated to obtain its fitness value, and then add the next generation to the current generation. Some bad chromosomes in the population may be discarded to keep a fixed-size population.
Finally, the GAs continue evolution process until the termination condition has bean met. When the GAs terminates, the best chromosome is the final result to the problem.
3.4 GA-based task scheduling flow
The GA-based task scheduling flow is illustrated in Figure 9. First, we generate an initial population. Next, the evolution process tries to explore the search space until it reaches the termination condition. Finally, the best chromosome in the population is our solution.
Saturation ? Saturation ?
Evolution Evolution
Finish Finish Initial Population Initial Population
no
yes Saturation ? Saturation ?
Evolution Evolution
Finish Finish Initial Population Initial Population
no
yes
Figure 9 : Task scheduling flow
3.5 Initial population
For initial population, each chromosome is generated using a meta-random scheme which is divided into two steps:
(1) The tasks in the task graph are sorted in topological order.
(2) The tasks are mapped onto the NoC platform sequentially in this order.
During step 2, we must consider 3 conditions:
(a) If the task has no precedence, the task is mapped randomly.
(b) If the task has only one precedence, the task is mapped according to the allocation of its precedence.
(c) If the task has two or more than two precedence, the task is mapped according to not only the allocation of its precedence but also the communication amount between the task and its precedence.
Take the task graph in Figure 10 as an example. First, we perform topological sort on the task graph, and the topological order is given by A, B, C, D. Next, task A is randomly mapped to the NoC platform. Then, task B and task C are mapped according to the allocation of task A. As shown as Figure 11, task B and task C have higher probability to be mapped onto the allocations that close to task A. Finally, task D is mapped according to the allocations of task B and task C. Obviously, edge B→D and edge C→D has different communication amount. Therefore, the probability should be higher for the allocations that near to task B than those near to task C. Figure 11 shows how to calculate the probability
D
Figure 10 : Task graph example
A
sequence Probability Allocation
A
sequence Probability Allocation
A
Figure 11 : Generate an initial solution
During the process of generating a chromosome, the constraints are also needed to be considered. The tasks cannot be assigned to the allocations in which the constraints may be violated. The initial population is generated with a fixed number of chromosomes which is generated by the meta-random scheme. Then, the fitness value of each chromosome is evaluated.
There are two reasons why we use a meta-random scheme to generate a chromosome.
First, a pure random scheme may cause a very bad performance. Second, the diversity of the chromosomes in the initial population should be kept as high as possible so that the GAs have higher probability to explore larger search-space. Due to these two reasons, we use a meta-random scheme to generate the chromosomes which not only consider the performance issue but also the diversity issue.
First, a pure random scheme may cause a very bad performance. Second, the diversity of the chromosomes in the initial population should be kept as high as possible so that the GAs have higher probability to explore larger search-space. Due to these two reasons, we use a meta-random scheme to generate the chromosomes which not only consider the performance issue but also the diversity issue.