This chapter evaluates the four dynamic scheduling approaches proposed in chapter 3 through a series of simulation experiments. In the experiments, the four approaches are compared under three different waiting queue sequencing policies: EDF (Earliest Deadline First), LLF (Least Laxity First) and SJF (Shortest Job First). The experiments simulate a 128-processor homogeneous cluster and an online workload based on a public workload log [13].The workload log contains 73496 records collected on a 128-node IBM SP2 machine at San Diego Supercomputer Center (SDSC) from May 1998 to April 2000. After excluding some problematic records based on the completed field [13] in the log, the simulation experiments in this thesis use 56490 job records as the input workload. The speedup of a job with different numbers of processors is calculated using Amdahl’s Law [30]. The parameter αindicates the fraction of computation within a job that is parallelizable in Amdahl’s Law [30]. We set α=0.6 in the following experiments. The deadline of each job is given according to the following formula.
D( i ) = Tsub(i)+ k * Texec(1,i)
where i is the index of jobs; Tsub(i) is the submission time of job i; Texec(n,i) is execution time of job i with n processors; k is a random number picked up within a specified range, 0.1 to kmax.
In the following, the four approaches are evaluated in terms of three different performance metrics: deadline miss rate, average turnaround time, and total profit. Average
turnaround time is the most commonly used performance metrics for comparing different batch job scheduling approaches, where turnaround time is calculated by subtracting the job submission time from the job finish time. Deadline miss rate is a typical performance metrics when discussing scheduling jobs with deadline and is defined to be the number of jobs unable to meet their deadlines divided by the total number of jobs submitted. In this thesis, the deadline of a job is a hard deadline and in our scheduling approaches whether a job can meet its deadline or not can be sure before it actually starts. Total profit is used as a performance metrics in this thesis since it would be one of the most concerns of HPCaaS providers. In the experiments, we use the total CPU time of all finished jobs to represent the total profit since in actual environments the charge for running a HPC job is usually proportional to the used CPU time.
Figures 4.1 to 4.4 evaluate each approach with different waiting queue sequencing policies. For DI Algorithm, shown in Figure 4.1, SJF is the best waiting queue sequencing policy in terms of deadline miss rate and average turnaround time. However, as the total profit is concerned, SJF leads to the least income for HPCaaS providers since it tends to serve smaller jobs which consumed less CPU time. The results indicate that there might be contradiction of interests between users and HPCaaS providers.
(a) (b) (c) Figure 4.1 Comparisons of EDF, LLF, and SJF in DI Algorithm
As for FM Algorithm (in Figure 4.2), SJF and EDF deliver very similar performance results in terms of deadline miss rate and average turnaround time. SJF still has advantage in average turnaround time, but EDF leads to a little bit smaller deadline miss rate. As the total profit is concerned, LLF leads to the highest income.
(a) (b) (c)
Figure 4.2 Comparisons of EDF, LLF, and SJF in FM Algorithm
The performance results of Algorithms 3 and 4, shown in Figures 4.3 and 4.4, respectively, show a very similar trend. The relative performance between EDF and SJF is
the same as for FM Algorithm in Figure 4.2. However, EDF leads to the largest income for Algorithms 3 and 4.
(a) (b) (c)
Figure 4.3 Comparisons of EDF, LLF, and SJF in RB Algorithm
(a) (b) (c)
Figure 4.4 Comparisons of EDF, LLF, and SJF in SNP Algorithm
Since Prescheduling Algorithm conducts a full-ahead planning of all the jobs in the waiting queue, in addition to the waiting queue sequencing policies another issue to consider is how the processors are allocated for each job in the planning process. We evaluated three possible allocation mechanisms in the following experiments. The first one, called Smallest Number of Processors (SNP), is the allocation mechanism described in the Prescheduling Algorithm in chapter 3, which tries to allocate the smallest number of
enough processors for a job to meet its deadline. Therefore, it starts with one processor and scans each future job-finish event to check whether it can meet its deadline by running with one processor beginning at that time instant. If running with one processor cannot meet its deadline in any available time instant, the allocated number of processors will be increased by one and the scan process repeats. The entire process will be repeated until an appropriate number of processors and a corresponding time instant are found or the conclusion that the job has no chance to meet its deadline is made.
The second allocation mechanism, called Threshold-based Largest Number of Processor (TLNP), works in the same way as SNP when current available number of processors is larger than a pre-defined threshold value. On the other hand, when the number of current available processors is equal to or below the threshold value, TLNP tries to allocate as many processors as possible to the job in order to allow it to finish earlier and prevent wasted fragmented resources. The third mechanism, called Earliest Start Time (EST) in the following figures, tries to allow each job to start as soon as possible. In contrast to SNP which scans all future job-finish events for trying to allocate a specific number of processors to the job before increasing the number of processors by one, EST checks all possible numbers of processors for allocation, from one to the number of free processors, at the time instant of a particular future job-finish event before proceeding to next job-finish event.
Figure 4.5 compares the performance of the three allocation mechanisms with Prescheduling Algorithm across the three different waiting queue sequencing policies. In the experiments, the threshold value in TLNP was set to 13. In terms of deadline miss rate, TLNP outperforms the other two with all the three waiting queue sequencing policies.
Comparing Figures 4.3 and 4.5, Prescheduling Algorithm outperforms RB Algorithm in terms of average turnaround time and TLNP in general delivers the shortest average turnaround time among the three allocation mechanisms. Regarding total profit, Prescheduling Algorithm has the potential to earn the most income when adopting EDF and EST together.
(a)
(b)
(c)
Figure 4.5 Comparisons of three different allocation mechanisms
Figure 4.6 evaluates the effects of different values of kmax, which determines the range of deadline. Regarding deadline miss rate, the influence of kmax is more significant for Algorithms 4. DI Algorithm has the potential to outperform the others for smaller ranges of deadline, while Prescheduling Algorithm is more likely to achieve the lowest deadline miss rate for larger ranges of deadline. In terms of average turnaround time, the performance of Prescheduling Algorithm drops more quickly as kmax increases. The impact of kmax on the relative performance of the four different scheduling approaches is not obvious for the total profit.
(a)
(b)
(c)
Figure 4.6 Evaluation of the impact of kmax
There usually won’t be only the kind of jobs with deadline in a real HPC environment.
Actually, an HPC environment usually has to serve a mix of two kinds of jobs. The first kind is usually called best-effort jobs. Most jobs in traditional supercomputing centers belong to this category, which have no deadline and the goal is to finish their computation as soon as possible. Therefore, average turnaround time is a typical performance metrics for this kind of jobs. The second kind of jobs have deadline and thus deadline miss rate becomes an important performance metrics for comparing different scheduling approaches.
Figure 4.7 evaluates two different waiting queue sequencing policies for an HPCaaS system to handle a mix of best-effort jobs and deadline-constrained jobs. The first policy sorts all the jobs in the waiting queue in the (First-Come-First-Serve) FCFS order. The second policy adopts the EDF policy to sort the jobs, where the deadline of best-effort jobs will be set to infinity.
In the experiments of Figure 4.7, the HPCaaS system has to serve a mix of 80%
best-effort jobs and 20% deadline-constrained jobs. Prescheduling Algorithm was used in the experiments. As expected, EDF favors deadline-constrained jobs, leading to a lower
deadline miss rate. On the other hand, FCFS can deliver a shorter average turnaround time.
Regarding the concerns of HPCaaS providers, EDF has the potential to earn more income as shown in Figure 4.7 (c).
(a)
(b)
(c)
Figure 4.7 Comparison of FCFS and EDF for a mix of two kinds of jobs (a) deadline miss rate of deadline-constraint jobs (b) average turnaround time of best-effort jobs (c) total profit for HPCaaS providers
Figure 4.8 evaluates the two waiting queue sequencing policies for scheduling a mix of best-effort and deadline-constrained jobs across different mix ratios. The relative performance of the two waiting queue sequencing policies in terms of all the three performance metrics is the same as in Figure 4.7.
(a)
(b)
(c)
Figure 4.8 Evaluation of different mix ratios (kmax = 2)
Figure 4.9 compares the four proposed approaches in terms of a new performance metrics, called lag of discard, which measures the time period between a job’s submission and the time when it is discarded because of being unable to meet the deadline. This performance metrics is a good measure for an HPCaaS system’s QoS since users can transfer their jobs to other HPCaaS providers earlier for meeting deadline with a smaller lag of discard. The experimental results indicate that Algorithms 1, 3, and 4 perform similarly while FM Algorithm is poor in terms of this performance metrics.
Figure 4.9 Evaluation of lag of discard