高效能計算即服務平台上具可調式平行度之工作排程問題研究

全文

(1)國立臺中教育大學資訊工程學系研究所碩士論文. 指導教授：黃國展. 博士. 高效能計算即服務平台上具可調式平行度之工作排程問題研究 Moldable Job Scheduling for HPC as a Service. 研究生：黃則齊. 撰. 中華民國一百零二年七月.

(2) 誌謝經過兩年的辛苦與努力，終於完成了這篇論文。能在中教大資工這個優良師資與環境下學習自己的專業能力與知識，感到非常幸運。尤其對我這個讀了六年的學校更是有者許多珍貴的回憶以及些許的不捨，從一進來的懵懵懂懂，到經歷了兩年的磨練以及付出，才能有今天的成長，這經驗對我來說更是獨一無二。首先要感謝的是我的指導教授黃國展老師，在這兩年中無私的帶領我們在學術領域上往前邁進，也對我們無論是身體保健或者待人處事上都給予叮嚀以及提供不少的幫助。在研究上，總是帶著親切和藹的笑容細心教導，並耐心地和我討論研究的細節與方向，老師的好脾氣也讓我少了一些遇到研究瓶頸的壓力，讓我這篇論文能夠順利完成。其次，要感謝我的家人及女友，由於你們在背後的支持，讓我沒有經濟上的負擔疑慮，在我人生徬徨無助時能夠繼續相依相惜，亦是另一股推進我向前的動力，之所以才能無後顧之憂的如期完成學業。最後，我要感謝和展、博仁及迪萱等學長學姐們對我在學業上的指導與協助，還有我的同學們英麟、柏鈞以及謝瑋、俊豪學弟與沐容、曉青學妹們，能跟你們一同在求學的道路上一起努力真的非常難得也十分珍惜，系統實驗室的夥伴: 小桂、麒璋、智忠、效維、瑤倫、柏誠、昱汶、建佑及孟儒，無論是在籃球場上、實驗室中也都相處得十分融洽，這些都將會是我難忘的回憶，謝謝各位。僅將此論文獻給所有曾經幫助過我的人。黃則齊謹誌 2013.07.12. I.

(3) 摘要過去，使用者假如想要遞交一個平行工作到超級電腦中心去執行，就必須要指定一個特定數量的處理器，然後工作排程系統才能據以安排每個工作的處理器使用量。不過，當所指定的資源數量與目前可用的資源數目無法吻合時，這樣的配置方式往往就會導致低落的系統使用率與拉長的工作完成時間。由於現今多數平行應用程式皆具有可調式的平行度，因此，可以在開始執行前才決定實際使用的處理器數量。此性可以被利用來發展出新的可調式工作排程方法，以進一步改善系統的整體效能與資源使用效率。最近，高效能運算即服務模式被提出的其中一項目標就是讓使用者可以更方便地使用高效能運算工具以及應用程式。我們認為免除使用者必須指定處理器使用數量的麻煩是邁向此一目標的重要一步，因為大部分高效能運算即服務的使用者並不清楚底下的應用程式架構與特性，因此很難恰當地指定一個最適合的處理器數目來提升應用程式的執行效能。為了達成此一目標，我們在此篇論文中提出了三個新的可調式工作排程方法，這些方法不僅可免除使用者指定處理器使用數量的麻煩，同時還能進一步提升系統的整體執行效能。我們所執行的實驗結果顯示這三個方法比起現有的方法而言，分別可以有效提升系統執行效能達 83%、 78%、及 89%之多。關鍵字：可調式平行度、適應性處理器配置、應用程式平行度模型、高效能運算即服務. II.

(4) Abstract Traditionally, users who submit parallel jobs to supercomputing centers need to specify the amount of processors that each job requires. Job schedulers then allocate resources to each job according to the processor requirement. However, this kind of allocation has been shown leading to degraded system utilization and job turnaround time when mismatch between requirement and available resources occurs. System performance could be improved through the moldable property which most current parallel application programs have. With moldable property, parallel programs can exploit different parallelisms for execution at runtime. Previous research has shown potential performance improvement achieved by adaptive processor allocation based on the moldable property. Recently, the concept of HPC as a Service (HPCaaS) was proposed to bring the traditional high performance computing field into the era of cloud computing. One of its goals aims to allow users to get easier access to HPC facilities and applications. This thesis deals with related job submission and scheduling issues to achieve such goal. Traditional HPC users in supercomputing centers are required to specify the amount of processors to use upon job submission. However, we think this requirement might not be necessary for HPCaaS users since most modern parallel jobs are moldable and they usually could not know how to choose an appropriate amount of processors to allow their jobs to finish earlier. Therefore, we propose three moldable job scheduling approaches which not only relieve HPC users’ burden of selecting an appropriate number of processors and but also achieve even better system performance than existing methods. The experimental results indicate that the three approaches can achieve up to 83%, 78%, and 89% performance improvement in terms of average turnaround time. Keywords: moldable property, adaptive processor allocation, application speedup model, HPC as a Service III.

(5) Table of Contents 誌謝 .............................................................................................................................................I 摘要 ........................................................................................................................................... II Abstract .................................................................................................................................. III Table of Contents .................................................................................................................... IV List of Figures .......................................................................................................................... V Chapter 1. Introduction ........................................................................................................... 1 Chapter 2. Related Work ......................................................................................................... 5 Chapter 3. Moldable Job Scheduling Without Runtime Information................................. 8 3.1 Auxiliary Moldable Job Scheduling.......................................................................... 9 3.2 Moldable Job Scheduling for HPCaaS ................................................................... 17 Chapter 4. Moldable Job Scheduling with Runtime Information ..................................... 25 4.1 Previous Moldable Job Scheduling with Runtime Information .......................... 25 4.2 Our Moldable Job Scheduling with Runtime Information for HPCaaS ............ 27 Chapter 5. Experiments and Performance Evaluation ....................................................... 31 Chapter 6. Conclusions .......................................................................................................... 37 References ............................................................................................................................... 40. IV.

(6) List of Figures Figure 1.1: Moldable job scheduling quadrants .................................................................... 3 Figure 3.1: The parallelism profile for low-variance speedup model ................................ 11 Figure 3.2: The parallelism profile for high-variance speedup model .............................. 12 Figure 3.3: Advantage of moldable job scheduling ............................................................. 15 Figure 3.4: Advantage of our auxiliary moldable job scheduling ...................................... 16 Figure 3.5: Linear speedup model ........................................................................................ 21 Figure 3.6: Amdahls’ law speedup model ............................................................................. 22 Figure 3.7: Downey’s high-variance model .......................................................................... 22 Figure 3.8: Downey’s low-variance model ........................................................................... 23 Figure 3.9: Comparison of auxiliary moldable job scheduling and moldable job scheduling for HPCaaS. ......................................................................................................... 24 Figure 4.1: The original moldable job scheduling for HPCaaS approach when applying the parallel allocation policy.................................................................................................. 28 Figure 4.2:Advantage of the moldable job scheduling with runtime information for HPCaaS ................................................................................................................................... 29 Figure 4.3: Algorithm of the improved moldable job scheduling for HPCaaS ................ 30 Figure 5.1: Evaluation of the auxiliary moldable job scheduling approach with Downey’s low-variance speedup model. ................................................................................................ 32 Figure 5.2: Evaluation of the auxiliary moldable job scheduling approach with Downey’s high-variance speedup model. ............................................................................................... 32 Figure 5.3: Evaluation of moldable job scheduling for HPCaaS with Downey’s low-variance speedup model. ................................................................................................ 33. V.

(7) Figure 5.4: Evaluation of moldable job scheduling for HPCaaS with Downey’s high-variance speedup model. ............................................................................................... 34 Figure 5.5: Evaluation of moldable job scheduling with runtime information using Downey’s low-variance speedup model. ............................................................................... 35 Figure 5.6: Evaluation of moldable job scheduling with runtime information using Downey’s low-variance speedup model. ............................................................................... 35 Figure 5.7: Evaluation of moldable job scheduling with runtime information using Amdahl’s law speedup model. ............................................................................................... 36. VI.

(8) Chapter 1. Introduction Parallel job scheduling and allocation has long been an important research topic [1][2][3]. Users at traditional supercomputing centers usually need to specify the amount of processors to use when submitting a parallel job. The workload management system and job scheduler will allocate computing resources to each job according to the specified processor requirement. If the specified amount of processors is larger than current available resources, the job would have to wait while the available resources are kept idle, resulting in degraded resource utilization and job turnaround time. The above situation might be unavoidable when running rigid jobs [4], which can only run with a specific amount of processors, or conducting performance benchmarking, e.g. drawing the speedup curve. However, most modern parallel applications, e.g. those written with MPI [5], usually have the moldable property [4], which allows them to exploit different parallelisms for execution at runtime. In such cases, system performance could be improved through adaptive processor allocation taking advantage of the moldable property [4]. Here, adaptive processor allocation means users and job schedulers have the flexibility to select a suitable amount of processors for job execution, considering both jobs’ parallelism characteristics and the amount of available resources at that moment. Some existing High-Performance Computing (HPC) workload management systems already support moldable job submission. For example, Load Sharing Facility(or simply LSF) [6],one of the most famous commercial HPC workload management system owned by IBM, allows users to specify a range of processor requirement, instead of a specific amount of processors, when submitting a moldable job. However, its scheduling mechanism for moldable jobs is quite primitive, simply adopting a greedy method. 1.

(9) to allocate as many processors as possible within the range specified by a moldable job upon submission. High performance computing (HPC) has long been a very important field for solving large-scale and complex scientific and engineering problems. However, accessing and running applications on HPC systems remains tedious, limiting wider adoption and user population [7]. As cloud computing emerges [8], which emphasizes easier and efficient access to IT infrastructure, recently the concept of HPC as a Service (HPCaaS) [7] was proposed to transform HPC facilities and applications into a more convenient and accessible service model. Unlike users at traditional supercomputing centers, who usually run parallel programs developed by themselves, most HPCaaS users run parallel applications developed by others and do not have a clear picture about the parallelism characteristics of the underlying applications. In such cases, users simply want to get their jobs done as soon as possible and don’t want or even have no idea on how to specify an appropriate amount of processors for the application’s execution. Therefore, moldable job scheduling becomes a crucial research issue for HPCaaS in the following two aspects. Firstly, it relieves users’ burden of selecting an appropriate number of processors upon job submission, leading to a much easier and convenient user experience for HPCaaS. Secondly, moldable job scheduling has the potential to improve the average turnaround time of parallel applications and the overall resource utilization, benefiting both HPCaaS users and providers. On the other hand, moldable job scheduling is also more feasible in HPCaaS since the parallelism characteristics and speedup model of the parallel applications are more likely to be available than those of home-made parallel programs on traditional supercomputers.. 2.

(10) Logically, moldable job scheduling approaches can be classified into four categories, as shown in Figure 1.1, according to two important aspects: submit-time or schedule-time decision and having job runtime information or not. Here, the job runtime information is the expected processor execution time required to process a job’s workload. Several moldable job scheduling approaches has been proposed in previous research [9][10][11], which fall into the categories in Figure 1.1 except the one without runtime information and making moldable decisions at submission time. In general, the approaches at schedule-time have potential to outperform the approaches at submit-time since job schedulers would have more accurate runtime information and greater flexibility in choosing an appropriate amount of processors for a specific job at schedule-time. Therefore, in this thesis, we focus on the two schedule-time categories and propose three moldable job scheduling approaches which take advantage of the application speedup models to make appropriate processor allocation decisions at schedule time, falling into categories II and III in Figure 1.1.. Figure 1.1: Moldable job scheduling quadrants. 3.

(11) Information about parallel program behavior is crucial for job schedulers to automatically choose effective amounts of processors for applications. In this thesis, we consider three commonly used parallel speedup models: linear speedup [12], Amdahl’s law [13] and Downey’s speedup model [14][15], which have been shown capable of representing the workload characteristics of both real parallel applications and entire system workload, such as NAS benchmarks [16], SDSC workload [17], and CTC workload [17]. The proposed moldable job scheduling approaches have been evaluated through a series of simulation experiments with different application speedup models and workload conditions. The experimental results indicate that our approaches have potential to outperform existing moldable job scheduling approaches, achieving up to 83%, 78% and 89% performance improvement, respectively. The remainder of this thesis is organized as follows. Chapter 2 discusses related works on moldable job scheduling. The moldable job scheduling approaches without runtime information is described in chapter 3. Chapter 4 presents the moldable job scheduling approaches taking advantage of job runtime information. Chapter 5 presents the experiments and the results of performance evaluation. Chapter 6 concludes this thesis.. 4.

(12) Chapter 2. Related Work Parallel jobs, according to the flexibility in parallelism, can be divided into four different classes [18]: (1) Rigid, (2) Moldable (3) Evolving, and (4) Malleable. Rigid jobs [4] can only run with a specific number of processors. Backfilling job scheduling approaches [19][20] are usually adopted to deal with the resource fragmentation problem and improve the overall system performance when scheduling rigid jobs in parallel systems. Moldable jobs [4] are flexible in the number of processors at the time the job starts, but cannot be reconfigured during execution. This is the type of parallel jobs we deal with in this thesis. Malleable jobs [4] are similar to moldable jobs in that they both can run with different parallelisms in contrast to rigid jobs. However, malleable jobs are even more flexible in that they can change the amount of processors used dynamically during execution, while moldable jobs must determine the number of processors to use before execution and then fix the amount of processors throughout the entire execution period. Sun et al. proposed an adaptive scheduling approach for malleable jobs with periodic processor reallocations based on parallelism feedback of the jobs and allocation policy of the system in [21]. Both Evolving and Malleable jobs can change their processor requirements during execution. For evolving jobs [22][23] changes are application initiated, while the changes in malleable jobs are system initiated. According to the classification in Figure 1.1, in the following we discuss related work on moldable job scheduling in parallel computing systems. For category I, Cirne and Berman proposed an application-level scheduling approach for moldable jobs in [24][25], where users provide a set of candidate requests with different processor requirements, and the application scheduler is used to adaptively select the most suitable request based on current system. 5.

(13) configuration and workload status. Since job schedulers would have more accurate runtime information and greater flexibility in choosing an appropriate amount of processors for a specific job at schedule-time, in this thesis, we focus on the two schedule-time categories. For category II, in [26], Sabin et al. proposed an iterative algorithm which utilizes job efficiency information for scheduling moldable jobs. The proposed algorithm has higher computational complexity since it is an iterative approach. In [9][10], Srinivasan et al. proposed a schedule-time aggressive fair-share strategy and a combined moldable scheduling strategy for moldable jobs, which adopts a profile-based allocation scheme. The strategies keep track of the information about all the free-time blocks available in current schedule and scan all the blocks to find the most suitable one for a moldable job at each scheduling activity, considering the effects of partition size on the performance of the application. The strategies thus need to have the knowledge of job execution time. For category III, Huang proposed and evaluated four adaptive processors allocation heuristics for moldable jobs in [11]. The heuristics determine the amount of processors to use for each moldable job when it becomes the first job in the waiting queue. Only current available resources and job queue information are considered when making processor allocation decisions in the heuristics. The proposed approaches do not require the information of job execution time and the processor allocation decision is made at job starting time instead of submission time. Most of the previous approaches require users to provide a candidate amount or range of amounts of processor requirement upon job submission. The moldable job scheduling approaches then tries to find the most appropriate amount of processors for job execution based on application characteristics and workload conditions. In this thesis, we aim to develop. 6.

(14) moldable job scheduling approaches which can not only relieve users’ burden of specifying processor requirement upon job submission, but also improve overall system performance by reducing the average turnaround time of all jobs. The proposed approaches can contribute to the realization of the HPC as a Service (HPCaaS) model [7].. 7.

(15) Chapter 3. Moldable Job Scheduling Without Runtime Information This chapter explores the issues of schedule-time moldable job scheduling without runtime information, falling into category III in Figure 1.1. The first part is a feasible extension to traditional HPC usage scenarios. In the first part, we propose an auxiliary moldable job scheduling approach which can dynamically adjust the amount of processors to use based on the information of application speedup model. In this case, users still specify a preferred number of processors to use upon job submission. However, since the job is moldable, the scheduler has the flexibility to dynamically adjust the number of processors actually used when making the allocation decision, in order to reduce the job’s average turnaround time and improve the overall resource utilization rate. In the second part, we propose a moldable job scheduling approach for the HPCaaS scenarios, where users do not have to specify an amount of processors to use upon job submission. The scheduler will automatically select a most appropriate amount of processors for a job according to its speedup model and the workload condition at the moment. The automatic processor allocation decision is expected to benefit both HPCaaS users and providers through reduced average turnaround time of jobs and increased resource utilization rate. The proposed moldable job scheduling approaches will be evaluated and compared to existing methods, including the greedy approach in LSF [6], the combined moldable scheduling strategy in [9][10], the adaptive processors allocation heuristics in [11], through a series of simulation experiments in chapter5.. 8.

(16) 3.1 Auxiliary Moldable Job Scheduling Most current HPC workload management systems require users to specify specific amount of processors to use for their jobs upon job submission. For moldable jobs, a feasible extension to this usage model is giving job scheduler the flexibility of changing jobs’ actual amount of used processors dynamically and adaptively right before starting their execution. In this way, the moldable job scheduling approach is auxiliary since users still need to provide preferred amounts of processors to use when submitting their jobs. Some existing systems, e.g. LSF [6], and previous research, e.g. adaptive processor allocation heuristics [11], have adopted this way to provide moldable job scheduling. In this section, we propose a new auxiliary moldable job scheduling approach, which takes advantage of the information about applications’ speedup model to make more effective processor allocation decisions, and thus is expected to outperform previous primitive methods [6][11]. In the greedy approach used in LSF [6], the HPC system allows users to specify a range of processor requirements, instead of a specific amount of processors, when submitting a moldable job. However, its scheduling mechanism for moldable jobs is quite primitive, simply adopting a greedy method to allocate as many processors as possible within the range specified by moldable jobs upon submission. In [11], the authors propose and evaluate four different variations of processor allocation heuristics, as described in detail in the following: . No adaptive scaling. This policy allocates a number of processors to each parallel job exactly according to its specified requirement.. . Adaptive scaling down. If a parallel job specifies a number of processors which at the schedule time is larger than the number of free processors, instead of keeping the job waiting in queue, the system automatically scales the job down to use exactly the number of free processors.. 9.

(17) . Adaptive scaling up and down. In addition to the scaling down mechanism described in the previous policy, this policy automatically scales a parallel job up to use the number of total free processors even if its original requirement is not that large.. . Restricted scaling up and down. This is a restricted version of the previous policy. To avoid that scaling up a parallel job would in turn delay the start time of the following jobs, the system scales a parallel job up only if there are no jobs behind it in queue. In [11], the experimental results indicate that in general the restricted scaling up and. down approach achieves the best performance. Both the two previous methods in [6] and [11] work in a very simple and straightforward way. On the other hand, our auxiliary moldable job scheduling approach incorporates the information of applications’ speedup model into the processor allocation decision process, aiming to further improve the overall system performance. Our auxiliary moldable job scheduling approach adopts Downey’s speedup model of parallel programs [14][15] to take into consideration of both single job speedup and entire system performance. The speedup model developed by Downey has been shown capable of representing the parallelism and speedup characteristics of real parallel applications [14][15]. The speedup of a job on n processors is defined as the ratio of the job’s run time on a single processor to the job’s run time on n processors:. Here, S is the speedup function, L is the effective sequential run time and T(n) is the run time of the job on n processors. Downey’s model is a non-linear function of the following two parameters [14]:. 10.

(18) . σ is an approximation of the coefficient of variance in parallelism within the job. It determines how close to linear the speedup is. A value of zero indicates linear speedup and higher values indicate greater deviation from the linear curve.. . A denotes the average parallelism of a job and is a measure of the maximum speedup that the job can achieve. Downey proposed two speedup models with low and high variances, respectively, in [14].. Figure 3.1 is a hypothetical parallelism profile for a program with low variance in degree of parallelism. The parallelism is equal to A, the average parallelism, for all but some fraction σ of the duration (0 ≤σ≤ 1). The remaining time is divided between a sequential component and a high-parallelism component (with parallelism chosen such that the average parallelism is A). The run time and speedup of a parallel program, as functions of processor number, with the low-variance model are described in equations (2) and (3), respectively.. Figure 3.1: The parallelism profile for low-variance speedup model. 11.

(19) Figure 3.2 shows a hypothetical parallelism profile for a program with high variance in parallelism. The profile consists of a sequential component of duration σ, a parallel component of duration 1, and potential parallelism A + Aσ-σ. A program with this profile would have the following run time and speedup as functions of processor number, described in equations (4) and (5), respectively.. Figure 3.2: The parallelism profile for high-variance speedup model. 12.

(20) It is easy to speed up a single moldable job and usually can be achieved by giving the job more processors. However, processor allocation of moldable jobs often faces the dilemma of whether to increase a job’s speedup as large as possible or not, since such speedup of a job might lead to enlarged turnaround time of another because the total number of processors in a system is usually fixed. Moreover, the speedup might be achieved at the cost of degraded system utilization since the efficiency of a parallel program is usually not 100% and might even decline as the number of used processors increases. Therefore, it is no trivial effort to determine the most appropriate number of processors for each job regarding the overall system performance of all jobs. Previous research in [14] has proposed the idea that an optimal allocation for a parallel job is the one that maximizes the power, which is defined as the product of the speedup and the efficiency. The concept was called calculating the knee in [14]. Based on the concept of knee, our auxiliary moldable job scheduling approach extends the restricted scaling up and down approach in [11] as described in the following.. 13.

(21) When a parallel job becomes the first job in the waiting queue, if its originally specified number of processors is larger than the number of free processors, instead of keeping it waiting in queue, the system automatically scales the job down to use exactly the number of free processors. On the other hand, if the number of free processors is larger than the job’s specified amount, the system automatically scales the job’s actual amount of processors up to the minimum of total free processors and the optimal value determined by calculating the knee based on the job’s speedup model. Moreover, to avoid that scaling up a job would in turn delay the start time of the following jobs in queue, the auxiliary moldable job scheduling approach scales a job up only if there are no jobs behind it in queue. Figure 3.3 is an example illustrating the advantage of moldable job scheduling over traditional rigid job scheduling. In the left part of the figure, since rigid job scheduling would try to allocate exactly the amount of processors specified upon job submission to each parallel job, task III cannot get enough processors to start its execution in the beginning, resulting in degraded average turnaround time of all three jobs and worse resource utilization rate. On the other hand, in the right part of the figure, moldable job scheduling would scale task III down to use only 20 processors for execution, instead of 30 specified originally. This arrangement allows task III to start its execution earlier with less number of processors. Although task III needs longer time to finish its execution in this arrangement, the total turnaround time is actually reduced since the waiting time of task III decreases to zero. The comparative example demonstrates the potential advantage of moldable job scheduling.. 14.

(22) Figure 3.3: Advantage of moldable job scheduling Figure 3.4 is another example illustrating the advantage of our auxiliary moldable job scheduling approach based on the concept of knee, compared to the greedy approaches in existing moldable job scheduling methods [6][11]. In the left part of the figure, since the greedy approach would try to scale up a job’s parallelism as large as possible, tasks I gets more processors for its execution. However, this, in turn, would delay the start time of tasks II and III due to insufficient amount of processors in the beginning, resulting in degraded average turnaround time of all three jobs and worse resource utilization rate. On the other hand, in the right part of the figure, our auxiliary moldable job scheduling approach would limit each job’s maximum parallelism to its knee value. Therefore, tasks I would consume less processors in this case, allowing task II to start its execution in the beginning and resulting in a shorter average turnaround time of all three jobs. The comparative example demonstrates the potential advantage of our auxiliary moldable job scheduling approach.. 15.

(23) Figure 3.4: Advantage of our auxiliary moldable job scheduling. 16.

(24) 3.2 Moldable Job Scheduling for HPCaaS The usage scenario in the previous section requires users to specify preferred amounts of processors upon job submission. This would be tedious and even difficult to determine for HPCaaS users who just want to run their application services as quickly as possible and do not have enough knowledge of the underlying parallel structure of the application services developed by the HPCaaS providers. In this section, we propose a moldable job scheduling approach for HPCaaS which can automatically select a most appropriate amount of processors to use for each job, benefiting both the job’s turnaround time and the resource utilization rate cared by HPCaaS providers. Without the processor requirement information provided by users, the job scheduler, in general, has two possible directions in making processor allocation decisions. The first direction is to allow as many jobs in queue as possible to run immediately. Therefore, each job would get only a small portion of total free processors for its execution, resulting a higher degree of concurrency among jobs. The alternative choice is to give the first job in queue as many processors as possible, allowing the job to run faster, but decreasing the degree of concurrency among jobs. We call these two directions parallel policy and serial policy, respectively. Which policy is better would largely depend on the parallel behavior of applications. Therefore, in the following we investigate the effects of these two allocation policies under three commonly used parallel speedup models. The three parallel speedup models considered in this section cover the behavior of most parallel applications. The first is the model usually introduced in the textbook of parallel processing[27], called linear speedup model hereafter in this thesis, where speedup is defined. 17.

(25) by Sp = T1/Tp, with p the number of processors, T1 the execution time of the sequential run, Tp the execution time of parallel processing with p processors. Based on the definition of speedup, efficiency is another performance metric defined as Ep=Sp/p = T1/pTp. Efficiency is a value, typically ranging between zero and one, estimating how well-utilized the processors are in solving the problem. The second model considered in this section is Amdahl’s law [13], which states that if P is the proportion of a program that can be made parallel, then the maximum speedup value that can be achieved by using N processors is S(N) =1 / ((1-P) + P/N).The third is Downey’s speedup model of parallel programs, which has been shown capable of representing the parallelism and speedup characteristics of many real parallel applications [14][15]. Downey’s model is a non-linear function of two parameters which has been described in details in the previous section. Based on the speedup models, the resultant average turnaround time of the two allocation policies can be derived. Equations (6) to (19) are the average turnaround time achieved by parallel policy and serial policy for the linear speedup model, Amdahl’s law model, Downey’s high-variance and low-variance models, respectively, where t is the job’s runtime, x is the parallel proportion between 0 and 1, n is the number of free processors, and d is the number of jobs in queue. For simplicity, n is assumed to be d’s multiple.. 18.

(26) Equations (6) and (7) are for the linear speedup model and the average turnaround time for Amdahl’s law model is equations (8) and (9). Equation (10) to (13) are for Downey’s high-variance model, and equation (14) to (19) are for Downey’s low-variance model with different n values, respectively.. [ 1 ≤ n ≤ A + Aσ – σ ]. 19.

(27) [ n ≥ A + Aσ - σ ]. [ 1 ≤ n ≤A]. [ A ≤ n ≤ 2A - 1 ]. [ n ≥ 2A – 1 ]. 20.

(28) Figures 3.5 to 3.8 compare the performance of parallel and serial allocation policies, in terms of average turnaround time, on different application speedup models. The comparison indicates that job scheduler has to adopt different processor allocation policies for applications of different speedup models. For example, the serial allocation policy is superior for applications of the linear speedup model, while the parallel allocation policy can achieve better performance for other models. Based on this analysis, we developed a moldable job scheduling approach for HPCaaS, which can automatically determine the amount of processors to use for HPC users and would not only relieve users’ burden of specifying appropriate numbers of processors but also achieve even better system performance than existing job scheduling methods. The proposed approach will be evaluated inchapter6.. Figure 3.5: Linear speedup model. 21.

(29) Figure 3.6: Amdahl’s law speedup modle. Figure 3.7: Downey’s high-variance model. 22.

(30) Figure 3.8: Downey’s low-variance model Figure 3.9 is an example based on Downey’s high-variance model, illustrating the benefits of the proposed moldable job scheduling approach for HPCaaS by comparing it with the auxiliary moldable job scheduling approach proposed in the previous section. In the lower part of figure, the auxiliary moldable job scheduling approach tends to give task I as many processors as possible, provided that the amount of processors allocated does not exceed its knee value. However, this arrangement would delay the start time of task III, resulting in a longer average turnaround time. On the other hand, in the upper part of the figure, the moldable job scheduling approach for HPCaaS tends to run the three tasks simultaneously since parallel allocation policy is better choice for Downey’s high-variance model as shown in Figure 3.7, leading to a shorter average turnaround time compared to the lower part. This comparative example demonstrates the potential superiority of the proposed moldable job scheduling for HPCaaS.. 23.

(31) Figure 3.9: Comparison of auxiliary moldable job scheduling and moldable job scheduling for HPCaaS.. 24.

(32) Chapter 4. Moldable Job Scheduling with Runtime Information The approaches discussed in the previous chapter fall into the category III in Figure 1.1, which do not require the runtime information of submitted jobs. This chapter explores the issues of moldable job scheduling in the category III of Figure 1.1, and proposes a new moldable job scheduling approach, which takes advantage of the runtime information of submitted jobs to make better processor allocation decisions. Jobs’ runtime information has been used in many parallel job scheduling methods, e.g. backfilling job scheduling methods for rigid jobs [19][20]. In such scheduling systems, users are required to provide job runtime information upon job submission. The job runtime information is then used to guide the advanced non-FCFS (First-Come-First-Serve) job scheduling through job backfilling. In this chapter, the job runtime information is used to help making better processor allocation decisions for moldable job scheduling through increasing resource utilization rate as more as possible.. 4.1 Previous Moldable Job Scheduling with Runtime Information Several previous submit-time or schedule-time moldable job scheduling methods [9][10] has adopted the job runtime information to improve job scheduling performance. In these methods, job runtime information is mainly used to derive a maximum number of processors to use for each job based on the notion of fair sharing. The proposed fair-sharing schemes believe in that each job should be given a limit on the maximum allowable amount of processors to use and the limit should depend on what fraction of the total weight of the jobs. 25.

(33) currently in the system, including both running and waiting jobs, a job constitute [10].Therefore, several schemes for calculating the limit have been proposed in previous research [9][10]. In [10], a scheme was proposed for calculating the maximum allowable amount of processors to use for each job as follows.. Weight fraction of job i=. where Weight refers to the effective sequential runtime of a job, and all currently running and queued jobs in the system are considered. In [9], Srinivasan et al. proposed another modified weight-fraction formula as follows.. Modified weight fraction of job i=. Given two jobs, with one having 4 times the weight of the other, the idea is to give the large job twice as many processors as the small job, with the effect that it would also have twice the runtime as the small job. Thus relative to the small job, the large job is reshaped to spread out in a roughly equal way along both the space and time dimensions. However, even though each sequential job can only take at most 1 processor, the modified weight-fraction formula would set aside a proportional amount of processors even for such jobs. To correct this inconsistency, they ignored the sequential jobs while calculating the weight fraction and the. 26.

(34) corrected modified strategy can improve much more performance in their study [9]. The corrected modified weight-fraction formula is:. Corrected modified weight fraction of job i=. Based on the above corrected modified weight faction calculation, a schedule-time combined moldable job scheduling strategy was proposed and shown to outperform previous submit-time and schedule-time moldable job scheduling approaches in [10][24].. 4.2 Our Moldable Job Scheduling with Runtime Information for HPCaaS In this section, we extend the moldable job scheduling for HPCaaS approach to take advantage of job runtime information to further improve the overall system performance. Figure 4.1 is an example showing the potential inefficiency of the original moldable job scheduling for HPCaaS approach when applying the parallel allocation policy. As shown in the figure, the parallel allocation policy tends to run all the jobs in queue, tasks I, II, and III, simultaneously. However, if the waiting queue remains empty till some of the running jobs finish their execution, the released resources will become idle and result in degraded resource utilization rate. This arrangement also leads to longer execution time for tasks I, II, and III since each of them is allocated a smaller portion of processors than available.. 27.

(35) Figure 4.1: The original moldable job scheduling for HPCaaS approach when applying the parallel allocation policy To avoid the potential inefficiency illustrated in Figure 4.1, we propose an improved moldable job scheduling approach for HPCaaS, which takes the advantage of job runtime information to improve overall system performance through increasing resource utilization rate. The proposed approach works as follows. It maintains two queue structures recording information about running and waiting jobs, respectively. On each processor allocation decision, it first scans the running queue to calculate the times of possible future resource releases resulting from the finishes of running jobs based on the job runtime information. Then, it counts the number of jobs in the waiting queue. If the number of waiting jobs is less than the times of possible future resource release, it simply changes the parallel allocation policy into serial allocation policy, and allocates the first job in the waiting queue for. 28.

(36) execution only. On the other hand, if the number of waiting jobs is larger than the times of possible future resource release, it keeps the parallel allocation policy, but only allocates the first n jobs in the waiting queue for execution. The value of n is calculated by subtracting the number of running jobs from the number of waiting jobs. In this way, the improved moldable job scheduling approach can raise the resource utilization rate as much as possible, and is expected to result in shorter average turnaround time for all jobs.. Figure 4.2: Advantage of the moldable job scheduling with runtime information for HPCaaS Figure 4.2 is another example showing the benefits of the improved moldable job scheduling approach. Compared to Figure 4.1, now tasks II and III are not allocated immediately at the same time as task I. Instead, their allocations are delayed until some running jobs finish their execution. In this way, although their waiting time is increased, they are allocated more processors for execution, compared to Figure 4.1, resulting in shorter. 29.

(37) turnaround time in average. Figure 4.3 shows the detailed algorithm of the improved moldable job scheduling approach. Algorithm: Improved moldable jobscheduling for HPCaaS Attributes: waitQueue.Length: runQueue.Length: number of running jobs numRD:number of jobs to be allocated 1: 2: 3: 4: 5: 6: 7:. numRD = waitQueue.Length–runQueue.Length; if (numRD> 0) { waitQueue.deQueue(first numRD jobs); allocate the numRD jobs using the parallel allocation policy; runQueue.enQueue(these numRD allocated jobs); }. 8: else 9: if (waitQueue.Length> 0) 10: { 11: waitQueue.deQueue(first job); 12: allocate the job with all current free processors 13: runQueue.enQueue(the allocated job); 14: }. Figure 4.3: Algorithm of the improved moldable job scheduling for HPCaaS. 30.

(38) Chapter 5. Experiments and Performance Evaluation This chapter evaluates the proposed moldable job scheduling approaches in chapters 3 and 4, and compare them with previous methods in [6][9][10][11].The performance evaluation was conducted through a series of simulation experiments, assuming a 128-processor cluster, based on a public workload log on SDSC’s SP2 [17].The workload log contains 73496 records collected on a 128-node IBM SP2 machine at San Diego Supercomputer Center (SDSC) from May 1998 to April 2000. After excluding some problematic records based on the completed field [17] in the log, the simulation experiments in this thesis use 56490 job records as the input workload. The two parameters, σ and A, for Downey’s speedup models were generated randomly. Moreover, we introduced an integer load factor in the simulation experiments, which was used to generate different levels of system load by simply multiplying it with the original job runtime recorded in the workload log. Figures 5.1 and 5.2 evaluate the proposed auxiliary moldable job scheduling approach and compare it with two previous greedy methods, LSF [6] and the adaptive processor allocation heuristics in [11], with Downey’s two parallel speedup models. The experimental results indicate that the proposed auxiliary moldable job scheduling approach outperforms the two existing methods greatly across different levels of system load, achieving up to 83% performance improvement in terms of average turnaround time, compared to LSF.. 31.

(39) Figure 5.1: Evaluation of the auxiliary moldable job scheduling approach with Downey’s low-variance speedup model.. Figure 5.2: Evaluation of the auxiliary moldable job scheduling approach with Downey’s high-variance speedup model.. 32.

(40) Figures 5.3 and 5.4 evaluate the proposed moldable job scheduling for HPCaaS approach and compare it with the previous auxiliary moldable job scheduling approach using Downey’s two parallel speedup models. The experimental results indicate that the proposed moldable job scheduling for HPCaaS approach not only can relieve users’ burden of specifying an appropriate amount of processors upon job submission, but also outperforms the previous method significantly across different levels of system load, achieving up to 78% performance improvement in terms of average turnaround time.. Figure 5.3: Evaluation of moldable job scheduling for HPCaaS with Downey’s low-variance speedup model.. 33.

(41) Figure 5.4: Evaluation of moldable job scheduling for HPCaaS with Downey’s high-variance speedup model. Figures 5.5, 5.6 and 5.7 evaluate our moldable job scheduling approach with runtime information, comparing it with the combined moldable job scheduling strategy in [9] and the previous moldable job scheduling for HPCaaS approach based on Downey’s two parallel speedup models and the Amdahl’s law speedup model. The experimental results indicate that the our moldable job scheduling approach with runtime information outperforms the previous methods significantly across different levels of system load, achieving up to 89% performance improvement in terms of average turnaround time.. 34.

(42) Figure 5.5: Evaluation of moldable job scheduling with runtime information using Downey’s low-variance speedup model.. Figure 5.6: Evaluation of moldable job scheduling with runtime information using Downey’s low-variance speedup model.. 35.

(43) Figure 5.7: Evaluation of moldable job scheduling with runtime information using Amdahl’s law speedup model.. 36.

(44) Chapter 6. Conclusions and Future Work Traditional supercomputing centers usually adopt rigid job scheduling, which requires users to specify the amount of processors to use upon job submission and then allocates computing resources to each job according to the specified processor requirement. However, if the specified amount of processors is larger than current available resources, the job would have to wait while the available resources are kept idle, resulting in degraded resource utilization and job turnaround time. Most modern parallel applications, e.g. those written with MPI, usually have the moldable property, which allows them to exploit different parallelisms for execution at runtime. In such cases, the traditional rigid job scheduling is inappropriate. Therefore, moldable job scheduling becomes an important research topic which aims to improve the overall system performance through adaptive processor allocation taking advantage of the moldable property. Moreover, as cloud computing emerges, recently the concept of HPC as a Service (HPCaaS) was proposed to transform HPC facilities and applications into a more convenient and accessible service model. For HPCaaS, users simply want to get their jobs done as soon as possible and don’t want to or even have no idea on how to specify an appropriate amount of processors for the application’s execution. Therefore, moldable job scheduling can contribute to HPCaaS in two important aspects. Firstly, it relieves users’ burden of selecting an appropriate number of processors upon job submission, leading to a much easier and convenient user experience for HPCaaS. Secondly, moldable job scheduling has the potential to improve the average turnaround time of parallel applications and the overall resource utilization, benefiting both HPCaaS users and providers. In this thesis, we classify previous research on moldable job scheduling into four quadrants according to two aspects: submit-time or schedule-time decision and having job. 37.

(45) runtime information or not. Based on this classification, we make three contributions to moldable job scheduling by taking advantage of the information of applications’ speedup models. The first contribution, called auxiliary moldable job scheduling, is a feasible extension to the usage model in most current HPC centers, which improves the overall system performance through limiting the maximum allowable amount of processors for each job according to the knee value calculated based on applications’ speedup models. In the second contribution, we proposed a moldable job scheduling approach for HPCaaS, which can automatically select a most appropriate amount of processors for a job’s execution based on applications’ speedup models and workload conditions at the moment, relieving users’ burden of selecting an appropriate number of processors upon job submission. The proposed approaches in the previous two contributions do not require job runtime information. In the third contribution, we propose an advanced moldable job scheduling approach, taking advantage of job runtime information to further improve the overall system performance. The proposed moldable job scheduling approaches were evaluated through a series of simulation experiments, and compared to previous methods in the literature. The experimental results indicate that our approaches outperform existing methods significantly, achieving up to 83%, 78%, and 89% performance improvement in terms of average turnaround time, respectively. Based on the results and experience in this thesis, two research directions are worthy of further exploration in the future. The first is evaluating the effects of inaccurate runtime estimation on the proposed moldable job scheduling methods, since job runtime plays an important role on allocation decisions. Although accurate runtime estimation is more probable in the HPCaaS usage scenarios than in traditional HPC platforms, 100% accuracy is still not. 38.

(46) possible. Therefore, it is desirable to evaluate the influence of runtime estimation accuracy. The second future research direction is to extend the proposed moldable job scheduling approach, when the parallel allocation policy is adopted, by exploring different resource partitioning strategies. In this thesis, a simple equal-partition strategy is used when the parallel allocation policy is applied, which means that each of the jobs, e.g. n jobs, in queue is allocated 1/n resources equally. Other possible partitioning strategies considering the workload difference among the jobs are worthy of further exploration and evaluation.. 39.

(47) References [1] D. G. Feitelson, “A Survey of Scheduling in Multiprogrammed Parallel Systems”, Proc. Research Report RC 19790 (87657), IBM T. J. Watson Research Center, Oct. 1994. [2] R. Gibbons, “A Historical Application Profiler for Use by Parallel Schedulers”, Proc. Job Scheduling Strategies for Parallel Processing, pp. 58-77, Springer-Verlag, 1997. [3] D. Lifka, “The ANL/IBM SP Scheduling System”, Proc. Job Scheduling Strategies for Parallel Processing, pp. 295-303, Springer-Verlag, 1995. [4] D. G. Feitelson, L. Rudolph, U. Schweigelshohn, K. Sevcik, and P. Wong, “Theory and Practice in Parallel Job Scheduling”. Proc. Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph (Eds.), pp. 1-34, Springer-Verlag, 1997. Lecture Notes in Computer Science Vol. 1291. [5] The Message Passing Interface standard, http://www.mcs.anl.gov/research/projects/mpi/ (June 2013) [6] Load sharing facility, http://www-03.ibm.com/systems/technicalcomputing/platformcomputing/products/lsf/ (June 2013) [7] M. AbdelBaky,M.Parashar, H. Kim,E. J.JordanKirk,V.Sachdeva, J. Sexton, H. Jamjoom, Z.Y. Shae, G. Pencheva, R. Tavakoli and M. F. Wheeler, “Enabling High Performance Computing as a Service”, Proc. IEEE Computer, Vol. 45, pp. 72-80. IEEE Press, Oct. (2012).. 40.

(48) [8] Cloud computing, http://www.infoworld.com/d/cloud-computing/what-cloud-computing-really-means-031 (Mar 2012) [9] S. Srinivasan, S. Krishnamoorthy and P. Sadayappan, “A Robust Scheduling Strategy for Moldable Scheduling of Parallel Jobs”, Proc. 5th IEEE International Conference on Cluster Computing, pp. 92-99, 2003. [10] S. Srinivasan, V. Subramani, R. Kettimuthu, P. Holenarsipur and P. Sadayappan, “Effective Selection of Partition Sizes for Moldable Scheduling of Parallel Jobs”, Proc. 9th International Conference on High Performance Computing, Springer, Lecture Notes in Computer Science, Bangalore, India, Vol. 2552, pp.174-183, 2002. [11] K. C. Huang, “Performance Evaluation of Adaptive Processor Allocation Policies for Moldable Parallel Batch Jobs”, Proc. 3th Workshop on Grid Technologies and Applications, Dec 2006. [12] D. L. Eager, J. Zahorjan, E. D. Lozowska,“Speedup versus efficiency in parallel systems”. IEEE Transactions on Computers archive Vol.38, Issue 3, pp. 408-423, March 1989. [13] L. Kleinrock, J.H. Huang: On parallel processing systems, “Amdahl’s law generalized and some results on optimal design”, Proc. IEEE Transactions Softw. Eng. 18(5) (1992) [14] A. B. Downey, “A Model for Speedup of Parallel Programs”, Proc. UC Berkeley EECS Technical Report, No. UCB/CSD-97-933, January 1997. [15] A. B. Downey, “A Parallel Workload Model and Its Implications for Processor Allocation”, Proc. the 6th International Symposium on High Performance Distributed. 41.

(49) Computing, 1997. [16] NAS parallel benchmarks, http://www.nas.nasa.gov/publications/npb.html (Jan 2012) [17] Parallel Workloads Archive, http://www.cs.huji.ac.il/labs/parallel/workload/ (Jan 2012) [18] M. W. Hall and M. Martonosi, “Adaptive parallelism in compiler-parallelized code. Concurrency: Practice and Experience”, 10(14):1235–1250, 1998. [19] D. G. Feitelson and A. M. Weil, “Utilization and Predictability in Scheduling the IBM SP2 with Backfilling”, Proc. 12th Int’l Parallel Processing Symp., pp. 542-546, Apr. 1998 [20] A. W. Mu’alem and D. G. Feitelson, “Utilization, Predictability, Workloads, and User Runtime Estimate in Scheduling the IBM SP2 with Backfilling”, IEEE Transactions on Parallel and Distributed Systems, Vol. 12, Issue 6, pp. 529-543, June 2001. [21] H. Sun, Y. Cao and W. J. Hsu, “Efficient Adaptive Scheduling of Multiprocessors with Stable Parallelism Feedback”, Proc. IEEE Transactions on Parallel and Distributed System, Vol. 22, No. 4, April 2011. [22] S. Ioannidis, U. Rencuzogullari, R. Stets, and S. Dwarkadas, “CRAUL: Compiler and run-time integration for adaptation under load”, Journal of Scientific Programming, Aug. 1999. [23] J. Pruyne and M. Livny. Parallel Processing on Dynamic Resources with CARMI. In D. G. Feitelson and L.Rudolph, editors, Proc. Job Scheduling Strategies for Parallel Processing, Vol 949, pp. 259–278. Springer, 1995. [24] W. Cirne and F. Berman, “Using Moldability to Improve the Performance of. 42.

(50) Supercomputer Jobs”, Journal of Parallel and Distributed Computing, Vol. 62, pp. 1571-1601, Oct 2002 [25] W. Cirne and F. Berman, “Adaptive Selection of Partition Size for Supercomputer Requests”, Proc. Job Scheduling Strategies for Parallel Processing Lecture Notes in Computer Science, Vol. 1911, pp. 187-207 , 2000 [26] G. Sabin, M. Lang and P Sadayappan, “Moldable Parallel Job Scheduling Using Job Efficiency: An Iterative Approach”, Proc. Job Scheduling Strategies for Parallel Processing, Saint Malo, France, June 2006. [27] T. G. Lewis and H. E. Rewini, Introduction to Parallel Computing, Prentice-Hall International, 1992.. 43.

(51)