Genetic Algorithms - 基於基因演算法應用於異質性網路單晶片系統之快速任務排程方法

Chapter 2 Preliminary

2.4 Genetic Algorithms

Task scheduling tries to allocate a set of tasks to resources such that the performance is optimal. Nevertheless, it is known as NP-complete. Therefore, people often use heuristic algorithms to deal with task scheduling problem [14][18][19][20][21].

In considering about the system performance, there are several important aspects. First, since the NoC platform contains heterogeneous computing resources, and tasks maybe suited to be executed on some kind of resources, thus, the execution time of a task depends on the resource it used. Second, the communication time

between tasks highly depends on the communication distance. Therefore, communication time can be greatly improved by mapping the communicating tasks onto the same resources. However, this may violate the constraints as mention before.

Moreover, the suitability of tasks and resources are neglected. As the result, the task scheduling problem must be solved with considering the trade-off among execution time, communication time and constraints.

Typically, Genetic algorithms are good at finding near-optimal solutions in a large search-space. As well, genetic algorithms do not require the knowledge of the search-space, they only need a measure of solution, it differ from many traditional optimization techniques [14][22][23]. In other words, we do not need to know how to arrange these tasks to get the best performance, we only have to define the performance of the solution. As a result, Genetic algorithms are quite suitable for the task scheduling problems.

Genetic algorithms are search algorithms based on the mechanics of natural selection and natural genetics. Genetic algorithms are different from other traditional optimization methods in very four fundamental ways [22] :

(1) Genetic algorithms use a coding of the parameter set instead of parameters themselves.

(2) Genetic algorithms search from a population of search nodes rather than a single one.

(3) Genetic algorithms use objective function, not derivatives or other

auxiliary knowledge.

(4) Genetic algorithms use probabilistic transition rules, not deterministic rules.

In order to employ genetic algorithms to solve our problem, first we need to encode the possible solutions of the optimization problems as a set of chromosomes.

Each chromosome represents a solution to the problem, and a cluster of solutions form a population. Next we generate the initial population, the chromosomes in the initial population are often generated randomly or heuristically. After that, we have to evaluate the fitness value of the chromosomes, this can judge how good the chromosome is to the problem, and it is important in the procedure of evolution.

In evolution process, we optimize the population by using genetic operators:

selection, crossover and mutation. The genetic algorithms select chromosomes from current generation by their fitness value. The higher fitness value the chromosome has, the higher probability it will be selected. In crossover and mutation procedures, next generation is generated by means of exploring the search-space. Finally, we evaluate the fitness value of chromosomes in the next generation, then add them into the current generation. In order to keep the size of population, some bad chromosome will be discarded. We can pick the better result generation by generation until the saturation condition is met, and we can find the solution in the best chromosome when genetic algorithms terminate.

Chapter 3

Task Scheduling

In this chapter, the proposed task scheduling method is presented. Procedures of our method will be explained explicitly in following articles. In addition, we will discuss the partition work and communication improvement in our algorithm into depth. We believe that proposed algorithm can improve the scheduling result and shorten scheduling time obviously.

3.1 Assumption

First, we need to define the constraints and make some assumptions.

There are two ways to implement a task, software(program) or hardware (logic).

Since local memory of processor and total capacity of the FPGA is limited, we should consider these two constraints. Memory constraint of processor restricts the size of programs and intermediate data of tasks. The capacity constraint of an FPGA is similar to memory constraint. It limits the total logics of tasks which are assigned to an FPGA.

There should be some buffers for performing a task. For instance, in Figure 3.1 we can see, before task A is executed, it has to wait for 4 units data from task B and 2 units data from task C. Totally, task A needs 6 units input buffer for these input data.

Besides, when task A is executing, the generated result needs to be stored in output buffer. Therefore, the minimum buffer requirement is 9 units.

D C B

4 2

3 D

C B

4 2

Figure 3.1：Example of task graph

However, this is not sufficient if our buffer has only the summation of input data and output data. First, when task A is executing, it receives 4 units data from task B and 2 units data form task C and stores them in input buffer. Moreover, the output buffer should prepare 3 units data for result of task A, and these requirements fill the buffer. Thus, neither task B nor task C can transmit data to task A until this job is finished, and this prevent the system from being executed continuously. Second, if output buffer of task A is full (task D does not finish receiving data from task A), it has to wait until the output buffer is clear, which means that it is idling during this period. Hence, in order to overcome this problem, we set our minimum buffer requirement to 18 units, which is twice of the minimum buffer requirement. Then the task can receive and transmit data whenever task is executed or not. This can greatly improve the system performance.

We don’t need to decide the execution order when more than two tasks are allocated to the same FPGA, because tasks are implemented in different parts of FPGA, none of them use the same component of the FPGA. But when more than two tasks are allocated to the same processor, the situation is different, we have to decide the execution order dynamically. It is not fitting to decide the execution order of the tasks in advance because of the dynamic behavior of communication. In addition, before the task is executed, it has to wait for all its input data.

According to above reasons, we choose a dynamic First In First Serve(FIFS) strategy to determine the execution order of tasks. It has two advantages.

(1)Flexibility to conquer the uncertainty of network. (2)Raise the utilization of processor by considering about the data availability of tasks [24]. A FIFS strategy is implemented as a queue. The task is pushed into the queue when all the input data are ready and output buffer size is sufficient. Then processor can execute the tasks according to the order.

3.2 Problem Formulation

Task scheduling problem can be formulated as follows:

Given:

(1) A task graph G(V,E) with communication and computation information (2) An NoC platform with following characteristics:

(a) mesh size

(b) memory size of processor (c) capacity size of FPGA

(d) buffer size of processing element

(e) communication bandwidth of each channel

Goal:

Use our algorithm to efficiently schedule each task to maximize system throughput

3.3 GA-based Task Scheduling Flow

The proposed GA-based task scheduling flow is illustrated in Figure 3.2. It includes the following procedures. Initial population and evolution composed of selection, crossover, mutation, simulation and insertion. First, we generate the initial population. Then the evolution process tries to explore the search space until it reaches the saturation condition we set. Finally, the best chromosome in the population is our solution.

Figure 3.2：GA-based task-scheduling flow

3.4 Initial Population

We use a meta-random scheme to generate the initial population of chromosome and it is divided into two steps:

(1) We use topological sorting scheme to sort the tasks in the task graph.

(2) The tasks are mapped onto the NoC platform in this order.

In step 2, we must consider 3 conditions:

(a) If the task has no precedence, the task is mapped randomly.

(b) If the task has only one precedence, the task is mapped according to the allocation of its precedence.

(c) If the task has more than one precedence, the task is mapped according to the allocation of its precedence and the communication amount between the task and its precedence.

We take Figure 3.3 as an example to demonstrate initial population, and the mapping steps of initial population are illustrated in Figure 3.5. First, a topological sort is performed on the task graph, and the topological order is given by A, B, C, and D. Next, task A is randomly mapped to the NoC platform. Then, task B and task C are mapped according to the allocation of task A. Comparing to task D, task B and task C have a higher probability of being mapped onto the allocations close to task A. Finally, task D is mapped according to the allocations of task B and task C. As we can see, edge B→D and edge C→D have different communication amounts. Thus, the probability should be higher for the allocations that are nearer to task B than those nearer to task C. By above descriptions, we can summarize the initial population process as listed in Figure 3.4, where the ticket in the pseudo code [22] means the probability.

Figure 3.3：Example of task graph

1. topological sort

2. according to the topological sort order, place the tasks.

2.1 random place the root since it has no parent 2.2 foreach task T in topological sort order

{ find free allocations (free resources) for T for each free allocation X

{ //calculate the ticket of X ticket = 0

for each parent P of the task

{ //the nearer of the distance between //the tasks, the higher of the ticket points //the more of the communication amount, //the higher of the ticket points

ticket += [(row_max+col_max-1)-manh_distance(X, P)] * Commu_amount(T,P) }

}

place the task T according to the location of its parent(s) using roulette wheel method [32]

}

[22]

Figure 3.4：Pseudo code of the initial population process

Figure 3.5：Detailed demonstration of an initial solution

There are two reasons for us to use a meta-random scheme to generate chromosomes. First, a pure random scheme may cause a very poor performance.

Second, the diversity of the chromosomes in the initial population should be kept as high as possible, this can in turn the genetic algorithms to have a higher probability of exploring a larger search-space. As a result, a meta-random scheme is used to generate the chromosomes.

3.5 Selection

Because of the principle of eugenics, an individual chromosome with a higher fitness value has a higher probability to produce another generation. Hence, pairs of parents from the population were selected using the roulette wheel method. Each chromosome in the population has a roulette wheel slot which is proportional to its fitness value. Then we can select the chromosome by spinning the roulette wheel. As illustrate in Figure 3.6, Chromosome A has the largest fitness value, so it occupies the largest space in the roulette wheel. The selected chromosomes will mate in the next stage by spinning the roulette wheel many times.

Figure 3.6：Roulette wheel method

3.6 Crossover

3.6.1 Proposed Crossover Method

The genetic algorithms use crossover to find the local optimal point. In nature, offspring inherit their features from their parents. For instance, if parents are tall, they often have tall children. So as in the genetic algorithms, the generated chromosomes inherit their features from their parents. Nowadays, we have many new applications with high complexity, follow the trend, the task graph will become more complex, and task number will become larger. In this situation, scheduling time will become longer, and we need more time to find a better solution for system performance.

Hence, this is a good issue for us to research.

If we examine traditional crossover algorithms carefully, we can find that these algorithms just randomly select a part of chromosome to exchange. When task graph becomes larger, some parts of tasks in the graph are always being neglected for crossover selection. However, they may have great opportunities for throughput improvement. That is, such crossover algorithm can not handle a large task graph well.

This situation will become worse when task number become larger. As a result, we must use some new crossover algorithm to overcome this weakness.

Our modified crossover flow is shown in Figure 3.7. First, we divide the task graph into several partitions by the execution order in the system, and then adjust the boundary according to the communication amount. Next, we select sub-graphs in

every partition and find the best shape which has the minimum communication overhead with surroundings [17]. In order to further control the communication overhead, we use a value to filter the crossover in every partition. If the communication overhead is larger than this value, we will not do the crossover. After that, we go to next partition and repeat the procedure.

Figure 3.7：Proposed crossover flow

3.6.2 Partition

In [16], it presents a partition genetic algorithm. We use the partition method in this article to divide the task graph into several partitions, and do the crossover in each partition. This article introduces the concept of blevel, which represents the execution order of the system. Partition method can make the crossover more balance in the task graph and get better result of scheduling. However, this crossover method doesn’t consider about the data dependency. It just uses the traditional chromosome to do crossover, and this will cause a bad simulation result. Therefore, we use the graphic-based chromosome [17] to solve this problem and do some further improvement to get our own crossover algorithm.

As shown in Figure 3.8, we use the value of blevel to divide the original chromosome into three partitions. Since the task number becomes smaller, crossover method can run more balance in the partitions.

Figure 3.8：Partition with Blevel

When we get a task graph, first we analyze it and get the information we need.

Next, we have to decide how many partitions are suitable for the original task graph.

Number of partitions influences the result of crossover. If the partition number is too small, then we won’t get great improvement in scheduling result. On the other hand, if the number of partitions is too big, then there will be heavy communication overhead and bring in worse result. Moreover, the partition number is related to scheduling time of each generation. Too many partitions will bring in longer scheduling time. Thus, the number of partitions is very important. In our algorithm, we divide the task graph by 30 tasks a partition. For example, if we have 200 tasks, then we will have 7 partitions in this task graph. We think this is a suitable partition

size for our algorithm since it can run our crossover method well and handle the communication overhead.

After deciding the partition, we adjust the partition boundary for task graph.

Since we want to reduce the communication between different partitions, we adjust the boundary to make the communication amount between different partitions become lower. When we do the crossover, we only pick tasks which are in the same partition. However, this crossover damages the allocation in initial population and cause heavy communication overhead between different partitions. As a result, if we can control the communication overhead between different partitions, this will improve the crossover result.

As a result, our strategy is shown in Figure 3.9. Task A has 4 units communication with next partition and 5 units inner communication, so we keep task A in the origin partition. Task B has 5 units communication with next partition and 6 units inner communication, so we keep it, too. Finally, Task C has 10 units communication with next partition and 7 units inner communication, therefore, we move it to the next partition. This procedure can reduce the communication between different partitions from 10 to 7.

Figure 3.9：Adjust boundary

3.6.3 Conditional Crossover

NoC architecture is communication-driven system. If we can do some improvement by considering communication overhead in each crossover, then we can get great improvement in the throughput of every generation. Thus, we can calculate the communication overhead of each crossover, and determine a value to judge whether to do the crossover or not. As Figure 3.10, we calculate the communication overhead of three sub-graphs as 35,25 and 32, and we use value 30 to filter each crossover, so we do the second crossover and cancel others.

This value can decide by experience and the parameters of the system. To sum up, our strategy is we choose a sub-graph and find the best shape, than we calculate

the communication overhead to decide if we do the crossover or not. Because of the partition, we do more than one crossover in each generation, so we can use this value to filter crossover and get better result.

Figure 3.10：Conditional crossover

3.6.4 Task Range

We use the partition technique to handle task graphs with different task sizes. We choose three sizes in our experiments, they are 200, 300 and 400 tasks in each task graph. We think that mesh size of NoC architecture influences the system performance. Every task range has a suitable mesh size, if we use a small mesh size to run a big task graph, then the system bottleneck may fall on the NoC architecture.

However, our focus is on the task scheduling algorithm, so we choose these 3 task

sizes and use the same mesh size 7*7(as Figure 3.11) for NoC architecture. If the complexity of a system is closer to the limitation an NoC architecture can provide, our scheduling algorithm can only achieve little improvement on the system throughput.

Processor

FPGA Processor

FPGA

Figure 3.11：Resource location

3.6.5 Two-Step Crossover

In the procedure of genetic algorithm, partition algorithm can get great improvement at the start of evolution. However, when chromosome is improved to certain degree, it becomes harder to get a better result by every crossover. In order to overcome this problem, we have to find a suitable crossover strategy in the later generation. Every crossover method has it own throughput curve and characteristic. If we can use other crossover method in the later generation to get better improvement through the whole evolution procedure of genetic algorithm, that will be a great solution to the scheduling problem.

In our algorithm, we change the crossover frequency from once in every partition to once in whole task graph. The consideration is like Simulated Annealing algorithm, when chromosome improves to certain degree, we change the crossover strategy to gradually approach better performance. Hence, we switch the crossover method at the 200^th generation, since it is the threshold point of improvement in our algorithm. After that, the improvement of our algorithm becomes worse. As a result, we use a two-step crossover method to get better improvement through the evolution procedure.

3.7 Mutation

We use the operator mutation to prevent the genetic algorithms from just finding a local optimal point. It may have the opportunity to reach or approach the global optimal by randomly changing the feature of the chromosome. The proposed mutation scheme is shown in Figure 3.12. It first selects a task randomly, then the selected task has a probability to move to a random allocation. This probability is decided by user according to their experience.

D C E B

1. Randomly select a task : task E

D C B E

2. Randomly move to new allocation D C

E B A

1. Randomly select a task : task E

D C B E

2. Randomly move to new allocation

Figure 3.12：Mutation operation

3.8 Simulation and Insertion

After crossover and mutating, it is necessary to evaluate the fitness value of

在文檔中基於基因演算法應用於異質性網路單晶片系統之快速任務排程方法 (頁 30-0)