Auxiliary Moldable Job Scheduling - Moldable Job Scheduling Without Runtime Information

Chapter 3. Moldable Job Scheduling Without Runtime Information

3.1 Auxiliary Moldable Job Scheduling

Most current HPC workload management systems require users to specify specific amount of processors to use for their jobs upon job submission. For moldable jobs, a feasible extension to this usage model is giving job scheduler the flexibility of changing jobs’ actual amount of used processors dynamically and adaptively right before starting their execution. In this way, the moldable job scheduling approach is auxiliary since users still need to provide preferred amounts of processors to use when submitting their jobs. Some existing systems, e.g.

LSF [6], and previous research, e.g. adaptive processor allocation heuristics [11], have adopted this way to provide moldable job scheduling. In this section, we propose a new auxiliary moldable job scheduling approach, which takes advantage of the information about applications’ speedup model to make more effective processor allocation decisions, and thus is expected to outperform previous primitive methods [6][11].

In the greedy approach used in LSF [6], the HPC system allows users to specify a range of processor requirements, instead of a specific amount of processors, when submitting a moldable job. However, its scheduling mechanism for moldable jobs is quite primitive, simply adopting a greedy method to allocate as many processors as possible within the range specified by moldable jobs upon submission. In [11], the authors propose and evaluate four different variations of processor allocation heuristics, as described in detail in the following:

 No adaptive scaling. This policy allocates a number of processors to each parallel job exactly according to its specified requirement.

 Adaptive scaling down. If a parallel job specifies a number of processors which at the schedule time is larger than the number of free processors, instead of keeping the job waiting in queue, the system automatically scales the job down to use exactly the number of free processors.

 Adaptive scaling up and down. In addition to the scaling down mechanism described in the previous policy, this policy automatically scales a parallel job up to use the number of total free processors even if its original requirement is not that large.

 Restricted scaling up and down. This is a restricted version of the previous policy. To avoid that scaling up a parallel job would in turn delay the start time of the following jobs, the system scales a parallel job up only if there are no jobs behind it in queue.

In [11], the experimental results indicate that in general the restricted scaling up and down approach achieves the best performance. Both the two previous methods in [6] and [11]

work in a very simple and straightforward way. On the other hand, our auxiliary moldable job scheduling approach incorporates the information of applications’ speedup model into the processor allocation decision process, aiming to further improve the overall system performance.

Our auxiliary moldable job scheduling approach adopts Downey’s speedup model of parallel programs [14][15] to take into consideration of both single job speedup and entire system performance. The speedup model developed by Downey has been shown capable of representing the parallelism and speedup characteristics of real parallel applications [14][15].

The speedup of a job on n processors is defined as the ratio of the job’s run time on a single processor to the job’s run time on n processors:

Here, S is the speedup function, L is the effective sequential run time and T(n) is the run time of the job on n processors. Downey’s model is a non-linear function of the following two parameters [14]:

 σ is an approximation of the coefficient of variance in parallelism within the job. It determines how close to linear the speedup is. A value of zero indicates linear speedup and higher values indicate greater deviation from the linear curve.

 A denotes the average parallelism of a job and is a measure of the maximum speedup that the job can achieve.

Downey proposed two speedup models with low and high variances, respectively, in [14].

Figure 3.1 is a hypothetical parallelism profile for a program with low variance in degree of parallelism. The parallelism is equal to A, the average parallelism, for all but some fraction σ of the duration (0 ≤σ≤ 1). The remaining time is divided between a sequential component and a high-parallelism component (with parallelism chosen such that the average parallelism is A).

The run time and speedup of a parallel program, as functions of processor number, with the low-variance model are described in equations (2) and (3), respectively.

Figure 3.1: The parallelism profile for low-variance speedup model

Figure 3.2 shows a hypothetical parallelism profile for a program with high variance in parallelism. The profile consists of a sequential component of duration σ, a parallel component of duration 1, and potential parallelism A + Aσ-σ. A program with this profile would have the following run time and speedup as functions of processor number, described in equations (4) and (5), respectively.

Figure 3.2: The parallelism profile for high-variance speedup model

It is easy to speed up a single moldable job and usually can be achieved by giving the job more processors. However, processor allocation of moldable jobs often faces the dilemma of whether to increase a job’s speedup as large as possible or not, since such speedup of a job might lead to enlarged turnaround time of another because the total number of processors in a system is usually fixed. Moreover, the speedup might be achieved at the cost of degraded system utilization since the efficiency of a parallel program is usually not 100% and might even decline as the number of used processors increases. Therefore, it is no trivial effort to determine the most appropriate number of processors for each job regarding the overall system performance of all jobs.

Previous research in [14] has proposed the idea that an optimal allocation for a parallel job is the one that maximizes the power, which is defined as the product of the speedup and the efficiency. The concept was called calculating the knee in [14]. Based on the concept of knee, our auxiliary moldable job scheduling approach extends the restricted scaling up and down approach in [11] as described in the following.

When a parallel job becomes the first job in the waiting queue, if its originally specified number of processors is larger than the number of free processors, instead of keeping it waiting in queue, the system automatically scales the job down to use exactly the number of free processors. On the other hand, if the number of free processors is larger than the job’s specified amount, the system automatically scales the job’s actual amount of processors up to the minimum of total free processors and the optimal value determined by calculating the knee based on the job’s speedup model. Moreover, to avoid that scaling up a job would in turn delay the start time of the following jobs in queue, the auxiliary moldable job scheduling approach scales a job up only if there are no jobs behind it in queue.

Figure 3.3 is an example illustrating the advantage of moldable job scheduling over traditional rigid job scheduling. In the left part of the figure, since rigid job scheduling would try to allocate exactly the amount of processors specified upon job submission to each parallel job, task III cannot get enough processors to start its execution in the beginning, resulting in degraded average turnaround time of all three jobs and worse resource utilization rate. On the other hand, in the right part of the figure, moldable job scheduling would scale task III down to use only 20 processors for execution, instead of 30 specified originally. This arrangement allows task III to start its execution earlier with less number of processors. Although task III needs longer time to finish its execution in this arrangement, the total turnaround time is actually reduced since the waiting time of task III decreases to zero. The comparative example demonstrates the potential advantage of moldable job scheduling.

Figure 3.3: Advantage of moldable job scheduling

Figure 3.4 is another example illustrating the advantage of our auxiliary moldable job scheduling approach based on the concept of knee, compared to the greedy approaches in existing moldable job scheduling methods [6][11]. In the left part of the figure, since the greedy approach would try to scale up a job’s parallelism as large as possible, tasks I gets more processors for its execution. However, this, in turn, would delay the start time of tasks II and III due to insufficient amount of processors in the beginning, resulting in degraded average turnaround time of all three jobs and worse resource utilization rate. On the other hand, in the right part of the figure, our auxiliary moldable job scheduling approach would limit each job’s maximum parallelism to its knee value. Therefore, tasks I would consume less processors in this case, allowing task II to start its execution in the beginning and resulting in a shorter average turnaround time of all three jobs. The comparative example demonstrates the potential advantage of our auxiliary moldable job scheduling approach.

Figure 3.4: Advantage of our auxiliary moldable job scheduling

在文檔中高效能計算即服務平台上具可調式平行度之工作排程問題研究 (頁 16-24)