平行計算台上工作流程排程問題中資源配置方法之研究

全文

(1)國立臺中教育大學資訊工程學系研究所碩士論文. 指導教授：黃國展. 博士. 平行計算平台上工作流程排程問題中資源配置方法之研究 Task Allocation in Workflow Scheduling on Parallel Computing Platform. 研究生：蔡英麟. 撰. 中華民國一百零二年七月.

(2) 誌謝這篇論文的完成，首先要感謝我的指導教授黃國展老師。感謝老師為學生在論文指導上所花費的精力與時間，讓學生能夠從無到有的順利完成論文。而且除了論文的指導外還有一些待人處事的道理，老師平常對我人格上的循循善誘，使我對自己性格上的缺點以及如何去克服有更深刻的認識，老師不但是我的授業老師，更是一位重要的人生導師。我還要感謝，所有曾經幫助過我的學長學弟學姊學妹還有好同學。讓我找到題目與方向的和展學長和迪萱學姐；常常有問必答地幫我解決各式問題的佑佑同學、博仁、阿搞學長；在做計劃和口試中互相幫忙的 Team:則齊、柏均同學和謝瑋、俊豪學弟和沐容、曉青學妹；壓力太大一起往峽谷咆嘯紓壓的系統夥伴小桂、麒璋、智忠、昱汶、建佑、孟儒；一起奮鬥的軟工夥伴柏寰、紹源等和網路的夥伴建豪、珊姐和資安夥伴的育彰等人族繁不及備載，謝謝你們的平常照顧。最後我還要感謝我的父母和女友，感謝你們的支持，感謝你們對我付出的一切，將這篇論文獻給最愛的你們並與你們分享我的喜悅。蔡英麟. 謹誌. 2013.07.12. I.

(3) 摘要隨著平行處理技術的進步以及格網與雲端計算等新興平台的出現，越來越多大型科學和工程應用逐漸採用工作流程的模式來表達其計算架構及內部不同計算模組間的相依性與資料傳遞關係。因此工作流程的排程議題日形重要，文獻中已有許多相關的排程方法被提出與討論。工作序列導向及分群導向的方式為最主要的兩類工作流程排程方法。本論文的主要貢獻有二，分別針對這兩類排程方式提出新的改良方法。其中第一部分針對工作序列導向的模式，提出了新的工作優先順序決定方式與計算資源配置方法。而在第二部分，我們針對分群導向的排程方式，以同時考量計算資源適合程度及工作群組完成時間的角度出發，提出了新的工作群組資源配置方法，可以處理多個工作流程同時進行排程的議題。上述這些新提出的方式都經過一系列模擬實驗的詳細效能評估，並與目前常用的典型工作流程排程方法進行比較。實驗結果指出，我們的方法比起目前常用的排程方法，可以達到顯著的工作流程執行效能提升效果。針對工作序列導向及分群導向的方式，平均最多分別可降低工作流程執行所需的完成時間達 11.8％和 15.5％之多。關鍵字：工作流程排程、工作序列導向方式、分群導向的方式、工作優先順序評定、計算資源配置. II.

(4) Abstract With the advancement of technology and emergence of grid and cloud computing, now many large-scale scientific and engineering applications are usually constructed as workflows due to large amounts of interrelated computation and communication. Many approaches have been proposed to deal with the challenging workflow scheduling problem. List scheduling and clustering are the two most common types of workflow scheduling heuristics. In this thesis, we make contributions to these two types of workflow scheduling, respectively. In the first part, we developed new task ranking and allocation methods for list-based scheduling approaches. In the second part, we proposed efficient task group allocation methods, considering both resource fitness and tasks’ EFT (Earliest Finish Time), for clustering-based concurrent workflow scheduling. The proposed approaches were evaluated through a series of simulation experiments and compared to typical workflow scheduling methods. The experimental results show that our approaches outperform existing methods significantly, achieving up to11.8 % and 15.5% performance improvement in terms of average makespan for list-based and clustering-based approaches, respectively. Keywords: workflow scheduling, list-based scheduling, clustering-based scheduling, task ranking, task allocation. III.

(5) Table of Contents 誌謝 ...........................................................................................................................................I 摘要 ........................................................................................................................................... II Abstract .................................................................................................................................. III Table of Contents .................................................................................................................... IV List of Figures .......................................................................................................................... V List of Tables ........................................................................................................................VIII Chapter 1. Introduction ........................................................................................................... 1 Chapter 2. Related Work ......................................................................................................... 4 Chapter 3. Task Ranking and Allocation in List-Based Workflow Scheduling .................. 7 3.1 Task Ranking .............................................................................................................. 8 3.2 Task Allocation.......................................................................................................... 16 Chapter 4. Task Group Allocation in Clustering-Based Multiple Workflow Scheduling 20 4.1 Task Group Allocation for Continuous Gap Search ............................................. 22 4.2 Task Group Allocation for Distributed Gap Search .............................................. 29 Chapter 5.Simulation Environment ..................................................................................... 34 5.1 Main System Components ....................................................................................... 34 5.2 Classes in the Simulator ........................................................................................... 35 Chapter 6. Experiments and Performance Evaluation ....................................................... 41 6.1 Experimental Setting................................................................................................ 41 6.2 Task Ranking and Allocation in List-Based Workflow Scheduling ..................... 42 6.3 Task Group Allocation in Clustering-Based Multiple Workflows Scheduling ... 50 Chapter 7. Conclusions and Future Work ........................................................................... 56 References ............................................................................................................................... 58. IV.

(6) List of Figures Figure 1.1: General DAG-based workflow. ........................................................................... 2 Figure 3.1: A workflow example of fork-join structure. ...................................................... 8 Figure 3.2:An example of general structure workflow. ...................................................... 10 Figure 3.3: The resultant schedules according to the two different task ranking methods: (a) bottom ranking (b) top + bottom ranking ...................................................................... 12 Figure 3.4: Example of fork-join structure workflow......................................................... 13 Figure 3.5: The resultant schedules according to the three different task ranking methods: (a) bottom ranking (b) top + bottom ranking (c) bottom amount .................... 15 Figure 3.6: (a) Example workflow (b) schedule by EFT(c) schedule by FST ................... 18 Figure 4.1: A task clustering example ................................................................................... 21 Figure 4.2: (a) An example of two concurrent workflows; (b) list-based scheduling result; (c) clustering-based scheduling result................................................................................... 22 Figure 4.3: Example of clustering-based best-fit allocation ............................................... 24 Figure 4.4: Example of enhanced best-fit task group allocation ........................................ 26 Figure 4.5: Algorithmic description of adjustable best-fit task group allocation ............. 28 Figure 4.6: the distributed gap search scheme by Jiang et al. [4] ...................................... 29 Figure 4.7: gap evaluation in adaptive distributed gap search .......................................... 30 Figure 6.1: Different task ranking methods for general DAG’s (HEFT) .......................... 43 Figure 6.2: Different task ranking methods for fork-join DAG’s (HEFT) ....................... 43 Figure 6.3: FST vs. EFT for General DAG’s ....................................................................... 44 Figure 6.4: FST vs. EFT for fork-join DAG’s ...................................................................... 44 Figure 6.5: (a) Example workflow (b) schedule by EFT on 3 resources (c) schedule by. V.

(7) FST on 3 resources (d) schedule by EFT on 6 resources (e) schedule by FST on 6 resources .................................................................................................................................. 45 Figure 6.6: Effects of different numbers of branches in fork-join DAG’s. ....................... 47 Figure 6.7: Effects of branch length in fork-join DAG’s. ................................................... 47 Figure 6.8: Effects of different numbers of branches in fork-join DAG’s ........................ 47 Figure 6.9: Effects of branch length in fork-join DAG’s. ................................................... 48 Figure 6.10: Evaluation of the integrated approach for General DAG’s .......................... 49 Figure 6.11: Evaluation of the integrated approach for fork-join DAG’s ........................ 49 Figure 6.12: Evaluation of the integrated approach with Montage................................... 49 Figure 6.13: Evaluation of the integrated approach with LIGO ....................................... 49 Figure 6.14: Montage ............................................................................................................. 49 Figure 6.15: LIGO .................................................................................................................. 49 Figure 6.16:100 workflows on 30-resource homogeneous system with continuous task group allocation (CCR=0.1) .................................................................................................. 51 Figure 6.17:100 workflows on 30-resource homogeneous system with continuous task group allocation (CCR=1) ..................................................................................................... 51 Figure 6.18:100 workflows on 30-resource homogeneous system with continuous task group allocation (CCR=10) ................................................................................................... 51 Figure 6.19: 100 workflows on 30-resource homogeneous system for different distributed task group allocation methods (CCR=0.1) ........................................................................... 52 Figure 6.20: 100 workflows on 30-resource homogeneous system for different distributed task group allocation methods (CCR=1) .............................................................................. 52 Figure 6.21: 100 workflows on 30-resource homogeneous system for different distributed task group allocation methods (CCR=10) ............................................................................ 52. VI.

(8) Figure 6.22: 100 workflows on 30-resource heterogeneous system for continuous task group allocation (CCR=0.1) .................................................................................................. 53 Figure 6.23: 100 workflows on 30-resource heterogeneous system for continuous task group allocation (CCR=1) ..................................................................................................... 54 Figure 6.24: 100 workflows on 30-resource heterogeneous system for continuous task group allocation (CCR=10) ................................................................................................... 54 Figure 6.25: 100 workflows on 30-resource heterogeneous system for distributed task group allocation (CCR=0.1) .................................................................................................. 54 Figure 6.26: 100 workflows on 30-resource heterogeneous system for distributed task group allocation (CCR=1) ..................................................................................................... 54 Figure 6.27: 100 workflows on 30-resource heterogeneous system for distributed task group allocation (CCR=10) ................................................................................................... 54 Figure 6.28: Effect of different numbers of DAG’s (CCR=10) .......................................... 54. VII.

(9) List of Tables Table 3.1: Computational costs and different task ranking results for Figure 3.2 ........... 10 Table 3.2: Computational costs and different task ranking results for Figure 3.4 ........... 13 Table 3.3: Computation costs of each node in the workflow of Figure 3.6 ....................... 18 Table 5.1:Class definition of Dag. ......................................................................................... 36 Table 5.2: Class definition of Queuing.................................................................................. 38 Table 5.3: Class definition of Simulation.............................................................................. 39. VIII.

(10) Chapter 1. Introduction Parallel task graph scheduling has long been an important research topic in the field of parallel processing and is well known to be a challenging NP-complete problem[24]. With the advancement of technology and emergence of grid and cloud computing, now many large-scale scientific and engineering applications are usually constructed as workflows, which are similar to traditional parallel task graphs in structure, due to large amounts of interrelated computation and communication[22]. Many open source workflow management systems, such as ASKALON [3], DAGman [6], Gridbus [14], Pegasus [30], has been developed to support workflow applications in parallel and distributed systems. However, most of the systems simply enforce the execution dependency defined in the workflow, but do not support workflow scheduling mechanisms. Common workflows usually can be represented by Directed Acyclic Graphs (DAG) [28] for describing the inter-task precedence constraints. Figure1.1 is an example of such kind of workflows. Each node represents a task which executes a specific program. The number next to each node means the required execution time of the task. The edges represent the dependence between tasks and the number next to an edge means the inter-task data transmission time. A workflow scheduler has to schedule and allocate each task according to the dependence specified in the workflow definition.. 1.

(11) Figure 1.1: General DAG-based workflow. Due to the complexity, most previous workflow scheduling research focused on scheduling a single workflow on parallel systems [5] [11] [15] [16] [24] [26] [35]. However, as modern high-performance computing platforms, such as grid and cloud, become prevalent, many users would run their workflow applications simultaneously. Therefore, it becomes a crucial issue to schedule multiple concurrent workflows efficiently. Many approaches have been proposed to deal with the challenging workflow scheduling problem in the literature [1] [7] [11] [15] [16] [26] [33] [35].List scheduling and clustering are the two most common types of workflow scheduling heuristics [16]. In this thesis, we make two contributions to these two types of workflow scheduling, respectively. In the first part of contribution, we developed new task ranking and allocation methods for single workflow scheduling based on list-based scheduling approaches. In the second contribution, we proposed an efficient task group allocation method, considering both resource fitness and tasks’ EFT (Earliest Finish Time), for concurrent workflow scheduling using cluster-based scheduling approaches. The proposed approaches were evaluated with a series of simulation experiments and compared to existing methods, such as the HEFT [16], the lookahead variant of HEFT [5], the pure best-fit approach [13], PCH approach [21][22], and the distributed gap search approach. 2.

(12) [18]. The experimental results show that our approaches in the two contributions outperform existing methods significantly, achieving up to 11.8% and 15.5% performance improvement in terms of average makespan, respectively The remainder of this thesis is organized as follows. Chapter 2 discusses related works on workflow scheduling. Chapter 3 presents our task ranking and allocation methods for single workflow scheduling. Chapter 4 deals with the task group allocation issues in multiple workflows scheduling. The simulation environment for the experiments is described in chapter 5. Chapter 6 presents the experiments and the results of performance evaluation. Chapter 7 concludes this thesis.. 3.

(13) Chapter 2. Related Work Workflow scheduling algorithms usually are classified into three categories [16]: (1) list-based, (2) clustering-based, and (3) duplication-based. A list-based heuristic approach maintains a list of all tasks of a workflow application according to their priorities and then schedules the tasks based on the list. There are several list-based heuristics proposed in the literature [5] [11] [15] [16] [24] [26] [35].One of the most famous list-based approach is HEFT (Heterogeneous Earliest Finish Time) developed by Topcuoglu, Hariri, and Wu in [16]. HEFT first computes the rank value of each task based on its computation and communication costs as well as the dependency with other tasks. After that, the tasks are put into a queue in the descending order of the rank value. Then, the scheduler allocates each task in the queue onto the processor which can lead to the earliest finish time for the task. A lookahead variant of HEFT was proposed in [24], which makes a task allocation decision by looking ahead in the schedule and taking into account information about the impact of this decision to the children of the task being allocated. A best-fit allocation technique was proposed in [13] to deal with the task allocation issue for multiple workflow scheduling based on a list-based approach. In the approach, the list scheduling heuristic is applied to allocate each individual task onto processors. During task allocation, several bin packing techniques, First Fit (FF), Best Fit (BF), and Worst Fit (WF), are used to search for appropriate idle time slots. Experimental results show that the Best-Fit heuristic achieves the best performance [13]. In this thesis, we explore the issues of task ranking and allocation in list-based workflow scheduling, and propose new approaches which outperform or improve the performance of existing methods.. 4.

(14) The main idea of clustering-based heuristic methods [32] is to reduce communication delay by grouping the tasks of heavy communication into a cluster. In general, a clustering-based heuristic method has two phases: clustering and merging. In the clustering phase, the original workflow application is partitioned into clusters, and the merging phase merges the clusters so that the remaining number of clusters equals to the number of resources. The Path Clustering Heuristic (PCH) in [22] is a typical example of clustering-based heuristics. It first uses the clustering technique to generate groups of tasks based on the inter-task dependency. After that, each group of tasks is allocated onto a resource contiguously to minimize the inter-task communication costs. The key advantage of PCH [22][23] is the reduced communication costs between tasks. During task group allocation, there will be some idle time gaps formed on resources because of the inter-task dependency and the data communication costs between different resources. Most proposed clustering-based heuristics [26][32][17][19][31]focus on different task clustering approaches and pay little attention to the allocation phase. Moreover, when scheduling multiple concurrent workflows, clustering-based approaches sometimes might lead to task groups too large to fit into any idle time slot. This, if happening, would degrade the overall system performance. Jiang et al. proposed a distributed gap search scheme to remedy such situations in [18]. In their approach, a task group will be broken down into a set of individual tasks first, and then each task is allocated according in a First Fit (FF) manner. In this thesis, we investigate the issue of task group allocation in multiple workflow scheduling using clustering-based approaches. We propose an adjustable task group allocation approach, which considers both resource fitness and tasks’ EFT (Earliest Finish Time) and was shown to outperform existing approaches in the simulation experiments.. 5.

(15) A duplication-based heuristic method [12] tries to reduce the communication cost for a task to transmit data to the resource of succeeding task(s) through duplicating the task on the destination processors. The duplication-based heuristics were shown potential to achieve good performance when scheduling a single workflow [12]. However, they might not be appropriate when scheduling multiple concurrent workflows since task duplication in a workflow would consume extra computation resources and thus degrade the performance of other workflows.. 6.

(16) Chapter 3. Task Ranking and Allocation in List-Based Workflow Scheduling List-based scheduling [5] [11] [15] [16] [24] [26] [35] is one of the most important workflow scheduling approaches. In list-based workflow scheduling, the entire scheduling process can be divided into two major steps: task ranking and task allocation. First, task ranking determines the priority of each task in a workflow, and then in the step of task allocation, tasks are allocated with resources for execution in the order of task priority. In this chapter, we propose new task ranking and allocation methods for list-based scheduling, and illustrate their effectiveness and superiority over existing approaches in [16]. The workflows discussed in this thesis can be represented by a Directed Acyclic Graph (DAG) [25], G (V, E), where: V is the set of tasks, tn ∈ V, |V| = number of tasks; E is the set of directed edges, en ∈ E, ∣E∣= number of edges The nodes without parents are called entry nodes, and the nodes having no children are named exit nodes. Each node in the workflow is a task representing a specific job or program to execute and each edge represents the data dependence between nodes. Each task starts its execution only after receiving all the required data from its precedent nodes. Each node in the workflow is associated with a weight representing the computation cost of the task, e.g. the required program execution time. Each edge in the workflow is also associated with a weight indicating the communication cost between the parent node and child node connected by the edge, e.g. the required data transfer time between the two tasks. If both tasks are allocated onto the same resource, the communication cost between them is assumed to be zero.. 7.

(17) In this chapter, we investigate the effectiveness of task ranking and allocation methods on two kinds of workflows with different structures. The first kind of workflows has a general DAG structure, and the second kind contains a more regular fork-join structure. As discussed in [14], DAGs of fork-join control structures are a common type of underlying structures for many workflow applications. Figure 3.1 is an example of such kind of workflow structure. There are languages and middleware, such as BPEL [4] and Xavantes [10], developed for programming such kinds of workflow applications.. Figure 3.1: A workflow example of fork-join structure.. 3.1 Task Ranking HEFT [16] is one of the most famous list- based workflow scheduling approach. Many later list-based approaches [5][24] follow the task ranking and allocation mechanisms in HEFT. In HEFT [165], the priority of each task is calculated in a way similar to the bottom-level calculation in [29]. Several other possible task ranking methods were also mentioned in [16], however, without further discussion and evaluation. In the following, we propose several alternative task ranking methods and illustrate that some of them have potential to outperform the popular task ranking mechanism in HEFT [16]. . Priority(Bottom):. 8.

(18) ( ). {. ∈. (. ). ( ). where Pi indicates the priority of task i, wi is the computation cost of task i, ci,j represents the communication cost between tasks i and j, and succ(ni) is the set of immediate children of task i. The bottom rank is the ranking mechanism adopted in HEFT [16]. . Priority(Top):. ( ). {. ∈. (. (. ). ). where pre(ni) is the set of immediate parent of task i. . Priority(Top+Bottom):. (. ). ( ). ( ). ( ). Figure 3.2 is an example of general-DAG workflow with the red line indicating the critical path. The second to the fourth columns in Table 3.1 show the computation costs of each task in Figure 3.2 on three different resources in a heterogeneous environment. The communication costs are shown next to the edges in Figure 3.2. Based on the computation and communication costs, the last two columns in Table 3.1 show the bottom rank, which was used in HEFT [16], and top + bottom rank of each task, respectively.. 9.

(19) Figure 3.2:An example of general structure workflow. Table 3.1: Computational costs and different task ranking results for Figure 3.2 Task Resource 1 Resource 2 Resource 3 Bottom Rank Top+Bottom Rank 1. 20. 7. 15. 162. 162. 2. 25. 29. 12. 173. 173. 3. 15. 28. 23. 176. 176. 4. 13. 24. 28. 175. 175. 5. 23. 6. 18. 135. 135. 6. 8. 11. 25. 135. 176. 7. 4. 25. 27. 125. 175. 8. 5. 7. 29. 122. 173. 9. 4. 11. 5. 104. 135. 10. 22. 8. 1. 89. 173. 11. 25. 30. 27. 97. 173. 12. 8. 27. 7. 100. 176. 13. 14. 15. 22. 61. 176. 14. 20. 2. 18. 13. 157. 15. 30. 21. 10. 20. 169. 16. 24. 12. 30. 22. 176. 10.

(20) One important feature of the top + bottom ranking mechanism is that it will give the highest priority to the tasks on the critical path. The critical path {3, 6, 12, 13, 16} is drawn in red line in Figure 3.2 and indicated in the red color in Table 3.1. A task will become ready once all its parents have been allocated. Among the set of ready tasks, the one with the highest priority will be selected for next allocation. Therefore, different task ranking mechanisms will lead to different task allocation sequence. For example, according to the priority of bottom rank, the task allocation sequence for the workflow in Figure 3.2 is <3, 4, 2, 1, 6, 5, 7, 8, 9, 12, 11, 10, 13, 16, 15, 14>. On the other hand, according to the priority of bottom+ top rank, the task allocation sequence will become <3, 6, 4, 2, 8, 10, 1, 7, 12, 5, 9, 11, 13, 16, 15, 14>. Figure 3.3 shows the resultant schedules according to the two different task allocation sequences, respectively. The task allocation method used is the Earliest Finish Time (EFT) principle in [16]. The scheduling results in Figure 3.3 indicate that the makespan, i.e. schedule length, based on the bottom is 136,while the makespan based on the top + bottom rank is 132.This is because the top + bottom ranking tends to allocate the tasks on the critical path earlier, as shown in Figure 3.3, leading to a shorter makespan.. (a). 11.

(21) (b) Figure 3.3: The resultant schedules according to the two different task ranking methods: (a) bottom ranking (b) top + bottom ranking However, we found that the situation is quite different as considering workflows of fork-join structure. Figure 3.4 is an example of such kind of fork-join workflow. Table 3.2 shows the computation costs of each task in the workflow in Figure 3.4 and three different task ranking results. The task allocation sequences according to bottom rank and top + bottom rank are <1 ,2 ,3 ,10 ,4 ,18 ,15 ,12 ,7 ,11 ,6 ,14 ,5 ,8 ,13 ,16 ,19 ,20 ,17 ,9 ,21 ,22 ,23> and <1 ,2 ,3 ,4 ,5 ,10 ,11 ,15 ,16 ,12 ,13 ,14 ,17 ,7 ,8 ,6 ,9 ,18 ,19 ,20 ,21 ,22 ,23>, respectively. Figure 3.5 (a) (b) show the resultant schedules according to the two different task allocation sequences, respectively. In contrast to the general workflow example in Figure 3.3, the top + bottom ranking mechanism leads to a makespan worse than that achieved by the bottom rank approach for the fork-join workflow. Careful investigation into Figure 3.5 revealed that the fork-join structure is the root cause of the above performance results. For the fork-join workflow in Figure 3.4, the critical path, indicated by the red line, is {1, 2, 3, 4, 5, 9, 22, 23}. Therefore, the top + bottom ranking mechanism will give nodes 3, 4 and 5 a higher priority than node 10. However, the amount of successors of node 10 in the sub-branch between node 10 and node 17 is larger than the. 12.

(22) amount of successors of node 3 in the sub-branch between node 3 and node 9. Therefore, giving nodes 3, 4, and 5 a higher priority will delay the start time of node 10, as shown in Figure 3.5, and lead to a larger amount of successors been delayed, resulting in a worse schedule length.. Figure 3.4: Example of fork-join structure workflow Table 3.2: Computational costs and different task ranking results for Figure 3.4 Task Resource 1 Resource 2 Resource 3. Bottom Rank. Top+Bottom Rank. Bottom amount Rank. 1. 20. 7. 15. 292. 292. 1558. 2. 29. 13. 15. 253. 292. 1519. 3. 23. 13. 24. 206. 292. 506. 4. 23. 6. 18. 158. 292. 158. 5. 8. 11. 25. 135. 292. 135. 6. 25. 27. 5. 142. 252. 142. 13.

(23) 7. 29. 4. 11. 147. 260. 147. 8. 22. 8. 1. 128. 260. 128. 9. 30. 27. 8. 96. 292. 96. 10. 21. 10. 24. 188. 276. 667. 11. 12. 30. 23. 146. 276. 146. 12. 29. 19. 7. 152. 267. 152. 13. 16. 21. 7. 128. 267. 128. 14. 12. 1. 19. 136. 262. 136. 15. 25. 22. 12. 153. 268. 153. 16. 10. 27. 3. 117. 268. 117. 17. 10. 9. 7. 100. 276. 100. 18. 15. 22. 20. 158. 223. 262. 19. 14. 19. 27. 114. 223. 114. 20. 24. 13. 14. 103. 188. 103. 21. 19. 13. 27. 81. 223. 81. 22. 29. 14. 9. 61. 292. 61. 23. 21. 7. 29. 19. 292. 19. Based on the above observation, we propose a new task ranking method, called bottom amount, for workflows of for-join structure. . Priority(Bottom amount):. ( ). {. ∑ ∈. (. (. ). ). In contrast to the bottom rank, where only the child with the maximum value of rank plus communication cost is counted, the proposed bottom amount method calculates the rank of a task by adding up its computation cost and all its immediate children’s rank plus the communication costs from the task to them. Therefore, the rank of each task can appropriately. 14.

(24) represent the amount of workload depends on it. The last column in Table 3.2 shows the rank of each task according to the bottom amount approach. The task allocation sequence based on the bottom amount rank is thus <1, 2, 10, 3, 18, 4, 15, 12, 7, 11, 6, 14, 5, 8, 13, 16, 19, 20, 17, 9, 21, 22, 23>. Figure 3.5 (c) shows the resultant schedule based on the task allocation sequence and indicates that the bottom amount ranking approach could achieve better makespan than both top + bottom rank and the bottom rank used in HEFT [16] for fork-join workflows.. (a). (b). (c) Figure 3.5: The resultant schedules according to the three different task ranking methods: (a) bottom ranking (b) top + bottom ranking (c) bottom amount. 15.

(25) 3.2 Task Allocation This section explores the issues of task allocation in list-based workflow scheduling. In most list-based approaches, including HEFT [16], task allocation follows a simple Earliest Finish Time (EFT) principle, where each task will be allocated to the resource leading to its earliest finish time, considering both its own computation cost and the communication costs from all its parents. The EFT principle is straightforward and reasonable, therefore widely adopted. However, we evaluated another simple task allocation method, called Fast (FST) principle, which simply allocates each task onto the resource leading to its minimum computation cost without considering any communication costs, and surprisingly found that it outperforms the EFT principle in many cases. Figure 3.6 illustrates such an example with a fork-join workflow. The red line in Figure 3.6 (a) indicates the critical path. The task allocation sequence based on the HEFT ranking mechanism is<1, 2, 17, 19, 18, 20, 3, 7, 4, 5, 6, 8, 9, 15, 12, 13, 14, 10, 11, 16, 21, 22>. According to the allocation sequence, two resultant schedules based on EFT and FST principles are shown in Figure 3.6 (b) and (c), respectively. The results indicate that FST leads to a makespan, 173, much better than the value, 202, achieved by EFT. Let’s focus on the allocation of the first six nodes {1, 2, 17, 19, 18, 20, 3}. In the case of EFT, nodes 17, 19, 18, 20 are all allocated on resource R2 since this arrangement can lead to their earliest finish time through avoiding the inter-task communication costs. However, this arrangement delays the start time of node 3 and its successors, including nodes 8, 21, 22, which are on the critical path. The delay of tasks on the critical path leads to degraded overall performance, a makespan of 202.. 16.

(26) On the other hand, in the case of FST, as shown in Figure 3.6 (c), nodes 17 and 18 are allocated to the fastest resources for them, R1 and R3, respectively. This arrangement leaves space on R2 for node 3 to start earlier. Earlier start time of node 3 in turn improves the start time of nodes 8, 21, and 22, compared to Figure 3.6 (b), leading to a shorter critical path execution time and an improved overall makespan, 173. Compared to EFT [16], the proposed FST principle not only has potential to achieve better workflow execution performance, but also has a lower computational complexity since it does not consider inter-task communication costs when making task allocation decisions.. (a). 17.

(27) (b). (c) Figure 3.6: (a) Example workflow (b) schedule by EFT(c) schedule by FST Table 3.3: Computation costs of each node in the workflow of Figure 3.6 Task. Resource 1. Resource 2. Resource 3. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22. 20 29 23 23 8 25 29 22 18 24 9 5 7 1 22 27 4 20 27 14 15 27. 7 13 13 6 11 27 4 8 30 13 29 6 20 18 12 3 7 25 1 13 22 29. 15 15 24 18 25 5 11 1 21 30 19 16 12 9 17 25 15 14 24 19 29 21. 18.

(28) In summary, this chapter proposes two alternative task ranking methods, top + bottom rank for general workflows and bottom amount rank for fork-join workflows, and one new task allocation principle, FST. According to the illustrative examples, the proposed methods have potential to outperform one of the most famous list-based scheduling approaches, HEFT [16].. 19.

(29) Chapter 4. Task Group Allocation in Clustering-Based Multiple Workflow Scheduling Clustering-based methods are one of the major categories of workflow scheduling approaches [19] [26][29][31][32]. In general, clustering-based workflow scheduling can be divided into three major steps: 1.. The first step clusters the tasks in a workflow into several task groups in order to minimize inter-task communication costs.. 2.. The second step puts the task groups into the ready queue for allocation according to the priority of each task group. A typical prioritizing mechanism is the Earliest Start Time (EST) of each task group calculated based on the static information on the task graph.. 3.. The third step allocates each task group in the ready queue onto an appropriate resource using a specific task group allocation mechanism.. The clustering technique in the first step focuses on reducing the communication costs between tasks. Figure 4.1 is an example of the typical task clustering technique used in PCH [21] [22]. In this example, the clustering process generates four task groups: {1, 2, 3, 4}, {5, 6, 7, 8, 9, 10}, {11 12}, and {13, 14, 15}, as shown with different colors in Figure 4.1.. 20.

(30) Figure 4.1: A task clustering example Previous research showed that clustering-based approaches have potential to outperform list-based scheduling methods [22]. Figure 4.2 shows such an example. In this example, there are two concurrent workflows for scheduling as shown in Figure 4.2 (a). Figure 4.2 (b) is the scheduling result of a typical list-based method. On the other hand, Figure 4.2 (c) illustrates the result of clustering-based scheduling where the tasks in the two workflows are clustered into 6 task groups, {A, C}, {B, E}, {D}, {1, 2}, {3}, {4}, for allocation. Because inter-task communication costs within the same task group become zero, clustering-based scheduling leads to a shorter makespan, compared to the list-based method. Most previous workflow scheduling research focused on scheduling a single workflow on parallel systems. However, as modern high-performance computing platforms, such as grid and cloud, become prevalent, many users might need to run their workflow applications simultaneously. Therefore, it becomes a crucial issue to schedule multiple concurrent workflows efficiently. In this chapter, we explore the issues of task group allocation in multiple workflow scheduling based on clustering-based methods.. 21.

(31) Figure 4.2: (a) An example of two concurrent workflows; (b) list-based scheduling result; (c) clustering-based scheduling result. 4.1 Task Group Allocation for Continuous Gap Search Most proposed clustering-based workflow scheduling approaches focused on how to cluster the nodes in a workflow into different task groups [22]. Although these task groups have to be allocated onto computing resources for execution after the clustering phase, few studies discuss the task allocation issue. When scheduling workflows onto computing resources, because of inter-task dependency and data communication costs, there are idle time slots formed between scheduled tasks on each resource. In [13] Stavrinides and Karatza proposed an approach to efficient utilization of the idle time slots through bin packing techniques. In their. 22.

(32) approach, the list scheduling heuristic is applied to allocate each individual task onto processors. During task allocation, several bin packing techniques, First Fit (FF), Best Fit (BF), and Worst Fit (WF), are used to search for appropriate idle time slots. Experimental results show that the Best-Fit heuristic achieves the best performance [13]. Best-Fit allocation has the potential to improve resource utilization. However, it might delay tasks’ start time and in turn degrade the performance of entire workflow because it skips some earlier available time slots to find the fittest one. Therefore, in this section, we propose an adjustable task group allocation approach to further improve multiple-workflow scheduling performance through a compromise between the task group’s finish time and the fitness of an idle time slot on task group allocation. Figure 4.3 is an example of best-fit task group allocation. In Figure 4.3, there are three workflows for scheduling at the same time. In the first step, the clustering technique is used to group the tasks, resulting in four task groups for the upper workflow, two task groups for the lower left workflow, and two task groups for the lower right workflow. Different task groups are marked with different colors in Figure 4.3. Then, the task groups in the upper workflow are scheduled first. When scheduling the lower two workflows, the approach first calculates the computation cost of each task group, e.g. {A, B, C, D}, through summing up the computation costs of all the tasks within the task group. Then, the best-fit technique, as in [13], is used to find a fittest idle time slot for allocating each task group.. 23.

(33) Figure 4.3: Example of clustering-based best-fit allocation Although the best-fit allocation in general can raise resource utilization, benefiting subsequent workflows, it might delay tasks’ start time and in turn degrade the performance of current workflow because it skips some earlier available time slots to find the fittest one. To overcome the drawback of the above best-fit allocation, we first proposed an enhanced best-fit task group allocation mechanism [37], which tries to make a balance between tasks’ finish. 24.

(34) time and the fitness of idle time slots when allocating task groups. In our enhanced best-fit allocation, an enhanced fitness value is calculated for each idle time slot which is large enough to accommodate the task group to be allocated. The fitness value of a time slot is calculated by summing up the finish time of the task group, if allocated on the time slot, and the difference between the lengths of the time slot and the task group. The time slot with the smallest fitness value will be chosen to allocate the task group. Figure 4.4 is an example illustrating the enhanced best-fit technique. The three workflows in Figure 4.4 are of the same fork-join structures as in the previous figure, however, with some nodes and edges having different weights. Figure 4.4(b) is the schedule produced by the original best-fit approach illustrated in Figure 4.3 and Figure 4.4(c) is the result generated by the enhanced best-fit allocation. Figure 4.4 shows that the enhanced approach improves the system performance in that the makespans of two workflows are shortened, from 103 to 52and from 117 to 116, respectively, while the performance of the other one remains the same. Moreover, the average makespan of all the three workflows is reduced from 110.6 to 93 as shown in Figure 4.4(d).. (a). 25.

(35) (b). (c). average makespan. (b). (c). 110.6. 93. (d) Figure 4.4: Example of enhanced best-fit task group allocation Based on the above enhanced best-fit allocation mechanism,, we propose an adjustable task group allocation approach since the interaction between the two effects of time slot fitness and task group’s EFT is complicated and affected by the workload and system environment. The allocation decision for each task group is made by evaluating each possible time slot between any two allocated task groups on every resource according to the following score calculation formula. The time slot with the lowest score will be selected to allocate the task group.. 26.

(36) …….. (1) In formula(1), σ is an adjustable parameter ranging between 0 and 1, which is used to adjust the relative weights of the two effects of time slot fitness and task group’s EFT. f is the evaluation of the time slot fitness calculated by subtracting the required computation time of the entire task group tx from the period of the candidate time slot. The function EFT( ) calculates the earliest finish time of tx if allocated on the candidate time slot. Figure 4.5 provides an algorithmic description of the adjustable task group allocation approach. The algorithm evaluates each idle time gap in the system in turn, as described at lines 1 and 2. Lines 3 to 11 first calculate the score of current gap according to formula (1) in the above. Then, the current gap is evaluated to check whether it is the lowest-score gap found so far. If no gaps can accommodate the task group or the gap completion time is greater than the end time of current schedule on some resources, it will be allocated to the end of current schedule on a specific resource, depending on which resource can allow it to finish at the earliest time. This is described at lines 14 to 20. Algorithm: Adjustable Best-Fit Task Group Allocation Input: Tr: total number of resources ni: total number of gaps on resourcei. σ: adjustable parameter ranging between 0 and 1 gapi(j): size of the jth gap on resource i. gapi(j).start: the start time of the jth gap on resource i gapi(j).end: the end time of the jth gap on resource i sizet: size (total computation cost) of the task group t task_gapi(j).end: record the expected finish time of the task group if allocated onto thejth gap on resourcei Variables: .. 27.

(37) min: the lowest score found so far, initialized as ∞ finali.end: the expected finish time of the task group if allocated onto the last task’s finish time on resource i finali.:the infinite gap starting at the last task’s finish time on resource i i :i is index of resource. j :j is index of gap on resource i. Output: found_gap: the index of the gap for allocation found_gap.end: record the finish time of the task group if allocated onto the found gap, initialized as 0 found_res: the index of the resource on which the gap is found 1: for i= 1 to Tr do 2: for j = 1 to ni do 3: if(gapi(j) sizet andtask_gapi(j).endgapi(j).end ) then 4: tempmin=score calculated according to formula (1) withσ 5: if (min>tempmin ) then 6: min = tempmin 7: found_gap = j 8: found_res = i 9: found_gap.end = task_gapi(j).end 10: end if 11: end if 12: end for loop 13: end for loop 14: for i=1 to Tr do 15: if (found_gap.end>finali.end ) then 16 found_gap.end = finali.end 17 found_gapi= finali 18: found_res = i 19: end if 20: end for loop Figure 4.5: Algorithmic description of adjustable best-fit task group allocation. 28.

(38) 4.2 Task Group Allocation for Distributed Gap Search The task group allocation algorithms discussed in the previous section try to allocate an entire task group into a single gap, idle time slot, on a specific resource. However, clustering-based approaches sometimes might lead to task groups too large to fit into any idle time slot. This, if happening, would result in both degraded resource utilization and delayed task completion time. Jiang et al. propose a distributed gap search scheme [18] to resolve the above problem, which allows for allocating the tasks of the same group into different gaps on different resources. This distributed scheme was shown to have potential to further improve resource utilization, leading to a better workflow execution performance in terms of makespan. However, in the distributed gap search scheme proposed in [18], before allocation each task group is first cut into individual tasks and then allocated into an appropriate gap based on the First Fit (FF) principle, as shown in Figure 4.6.. Figure 4.6: the distributed gap search scheme by Jiang et al. [4] The above distributed gap search approach proposed in [18] might compromise the potential advantage of clustering-based workflow scheduling since each task group is. 29.

(39) decomposed into every individual task before allocation, resulting in some unnecessary inter-task communication overheads. In this section, we propose an adaptive distributed gap search scheme which aims to minimize unnecessary inter-task communication costs while keeping the advantage of distributed gap search. In our approach, each task group will be cut into several subgroups during the allocation process, in contrast to being decomposed into individual tasks in the beginning as in the original distributed gap search scheme [18]. At each decomposition activity, an original task group is cut into two new subgroups. The first subgroup contains the largest number of tasks which can be fitted into the gap under consideration, and the other subgroup consists of the remaining tasks. Since each subgroup would contain as many tasks as possible, the inter-task communication costs can be minimized. However, the adaptive distributed gap search approach would require a more complicated gap evaluation procedure, since different gaps in the schedule would make a task group being decomposed into subgroups of different sizes which might lead to different degrees of fitness to the gaps and different finish time of the original task group as shown in Figure 4.7.. Figure 4.7: gap evaluation in adaptive distributed gap search. 30.

(40) The score formula (1) in section 4.1 cannot deal with the complicated gap evaluation issue in the proposed adaptive distributed gap search approach, since each gap might not be able to accommodate the entire original task group and thus the EFT of the entire task group is not available. To overcome this difficulty, we define a new measure, called task group completion ratio, as follows. C. l. (G. 𝑥. ). 𝑧 𝑜𝑓 ℎ 𝑓 𝑧 𝑜𝑓 ℎ. 𝑏𝑔 𝑜. 𝑜 𝑔 𝑎𝑙 𝑎 𝑘 𝑔 𝑜. ….. (2). where the denominator is the total computation costs of the original task group before decomposition and the numerator is the accumulated computation costs of the first subgroup after decomposition. Based on the new metric in formula (2), each gap is now evaluated with a new score function, defined in formula (3), in the adaptive distributed gap search approach.. … (3) In the above formula (3),f is an evaluation of gap fitness calculated by subtracting the required computation time of the first subgroup after decomposition from the period of the candidate gap. EFT(tx) is the Earliest Finish Time (EFT) of the first subgroup if allocated to the gap under consideration. σis an adjustable parameter ranging between 0 and 1, which is used to adjust the relative weights of the two effects of gap fitness and the first subgroup’s EFT. Since the entire score function in formula (3) is based on the smaller-is-better principle, the task group completion ratio is put in the denominator of the last term. Figure 4.8provides an algorithmic description of the adaptive distributed task group allocation approach. The while loop beginning at line 1 that will be executed before all the tasks in the task group are allocated. The algorithm evaluates each idle time gap in the system in turn, as described at lines 2 and 3. Lines 4 to 12 deal with the case that the gap can. 31.

(41) accommodate the entire task group. Lines 13 to 24 handle the case that current gap is not large enough for the task group through cutting the task group into two subgroups for allocating the first subgroup first. Algorithm: Adaptive Distributed Task Group Allocation Input: Tr: total number of resources nt: the number of tasks in the task group ni: total number of gaps on resource i. σ: adjustable parameter ranging between 0 and 1 gapi(j): size of the jth gap on resource i. gapi(j).start: the start time of the jth gap on resource i gapi(j).end: the end time of the jth gap on resource i sizet: size (total computation cost) of the task group t task_gapi(j).end: record the expected finish time of the task group if allocated onto thejth gap on resource i Variables: min: the lowest score found so far, initialized as ∞ finali.end: the expected finish time of the task group if allocated onto the last task’s finish time on resource i finali.:the infinite gap starting at the last task’s finish time on resource i i :i is index of resource. j :j is index of gap on resource i. k: index of task in the task group Output: found_gap: the index of the gap for allocation found_gap.end: record the finish time of the task group if allocated onto the found gap, initialized as 0 found_res: the index of the resource on which the gap is found 1. While(some of the task group's tasksare not yet scheduled) 2. 3.. for i= 1 to Tr do for j = 1 to ni do. 4.. if(gapi(j) sizet. and task_gapi(j).endgapi(j).end ) then. 32.

(42) 6. 7. 8. 9. 10. 11. 12. 13. 14.. tempmin=score calculated according to formula (3) with σ, assuming the task group completion ratio to be one. if (min>tempmin ) then min = tempmin found_gap = j found_res = i found_gap.end = task_gapi(j).end k = nt end if else according to gapi(j), decompose the task group into two subgroups and try. 15. 16. 17. 18. 19. 20. 21. 22. 23.. to allocate the first subgroup into the jth gap on resource i (gapi(j)) k = the index of the last task in the first subgroup after decomposition calculate task group completion ratio according to formula (2) tempmin=score calculated according to formula (3) with σ if (min>tempmin ) then min = tempmin found_gap = j found_res = i found_gap.end = task_gapi(j).end end if. 24. 25. 26. 27. 28.. end if end for loop end for loop return k //indicating the decomposition point of the original task group end While loop. 5.. Figure 4.8 Algorithm description of adjustable distributed task group allocation. 33.

(43) Chapter 5.Simulation Environment This chapter presents the software simulator we developed for conducting the performance evaluation of the proposed task ranking and allocation methods for multiple workflow scheduling. The simulator was developed based on the discrete-event simulation methodology [8]. The entire simulation process is controlled by the function MainSimulator(). It controls what types of DAGs to generate and what kinds of scheduling methods to use. It also determines the number of computing resources and their properties. Section 5.1 describes main system components in the simulator. Section 5.2 presents the classes used to build the simulator.. 5.1 Main System Components The following describes three data components used to represent workload, queuing mechanisms, and computing environment in the simulator. . Input Workload: A workflow can be represented by Directed Acyclic Graphs (DAG) for describing the. inter-task precedence constraints. A DAG is defined as G = (V, E), where V is a set of nodes, each representing a task, and E is a set of edges, each representing the computation precedence order between two tasks. In the simulator a DAG is represented by a linked-list data structure where each node may have multiple outgoing pointers linked to its succeeding nodes. . Queuing System: There are two system queues: a waiting queue and a ready queue. All tasks of a workflow. will be put into the waiting queue upon its submission. As time goes on a task or task group will. 34.

(44) be moved into the ready queue waiting for scheduling and resource allocation once its preceding tasks finish execution. . Computing Environment: In this thesis, the parallel computing platform is assumed to be composed of several. physical machines located at the same place. The computing resource used by a task can be a physical machine itself or a virtual machine running on the physical machine via the virtualization technology. Hence, the computing resources in the parallel computing platform may be heterogeneous in various aspects, such as computing speed, memory size, and hard disc size. Here, we focus on the heterogeneity of computing speed. The communication speed between any two machines is assumed to be the same. The costs incurred by a task include computation and communication costs, which represent task execution time and data transfer time, respectively. The computation cost of a task will be affected by the computing speed of the resource executing it, which means the computation cost of the same task may be different on different machines. The communication cost within the same physical/virtual machine is set to be zero.. 5.2 Classes in the Simulator This section describes the major classes used to build the simulator, including Dag, Queue, and Simulation. . Dag: Class Dag is responsible for generating input workload, a series of workflows having. various properties and arriving at different time. It can generate two types of workflows: the. 35.

(45) general DAGs and the fork-join DAGs as shown in Figures 1.1 and 3.1, respectively. Table 5.1 shows the Dag class in the UML style. It contains 6 attributes and 6 functions. Table 5.1:Class definition of Dag. Dag +Dagnum: int +Dag_starttime: int +Node: unsigned int +Edge: unsigned int +CCR: double +Group_num: unsigned int Fork-join(intdagnum) General( intdagnum) node_generator() edge_generator(double ccr) Compute_Rank() PCH() The attributes and functions in Dag are described as following:. Attributes: 1.. Dagnum: the number of DAGs to be generated.. 2.. Dagstarttime: representing the DAG submission time, a random number.. 3.. Node: the number of tasks in a DAG, a random number.. 4.. Edge: the number of edges in a DAG, a random number.. 5.. CCR: communication cost to computation cost ratio.. 6.. Group_num: the number of task groups within each DAG.. 36.

(46) Functions:. 1.. Fork-join(intdagnum): randomly generating a fork-join DAG. It first randomly determines the numbers of nodes and edges and then invokes Node_generator() and Edge_generator().. 2.. General(intdagnum): randomly generating a general DAG. It first randomly determines the numbers of nodes and edges and theninvokes Node_generator() and Edge_generator().. 3.. Node_generator( ): randomly generating the attributes of each node, such as computation cost.. 4.. edge_generator( ): generating the inter-task dependency structure and the communication costs, according to the CCR value. 5.. Compute_Rank():computing the various rank value of each node for the workflow.Rank value contains various the bottom rank, the top + bottom rank, and the bottom amount rank. 6.. PCH( ):PCH is responsible for clustering nodes within a DAG into a set of task groups according to the inter-task dependency structure.. 37.

(47) . Queuing: This class implements the dynamic scheduling mechanism and ready task queuing. behavior in the simulator. Table 5.2 shows the members of the Queuing class. Table 5.2: Class definition of Queuing. Queuing +dagchecklist: LinkedList +groupchecked: unsigned int +queuelist: LinkedList +time: int +priority: unsigned int add_into_queue (int dag, int group, int priority) The attributes and functions in Queuing are described as following:. Attributes:. 1.. dagchecklist: a list of DAGs to be checked for new ready task groups or tasks.. 2.. groupchecked: the number of task groups checked.. 3.. queuelist: the list containing all ready task groups or tasks.. 4.. time: the global system time.. 5.. priority: the priority for each task group based on the computation cost or the rank value of each task in a group.. Functions:. 38.

(48) 1.. add_into_queue (int dag, int group, int priority): adding ready tasks or task groups within a DAG into the ready queue based on their priority values.. . Simulation: Class Simulation implements all the scheduling and allocation methods in this thesis.. Table 5.3 shows the members of the Simulation class. Table 5.3: Class definition of Simulation. Simulation +resourcesnum: int +efficiency: double [ ] +sim_starttime: unsigned int [ ] +sim_comptime: unsigned int [ ] +sim_overtime: unsigned int [ ] +simed: Boolean [ ] HEFT(int dag, int task) FST(int dag, int task) gap_search(int dag, int group) distributed_gap_search(int dag, int group) The attributes and functions in Simulation are described as following:. Attributes: 1.. resourcesnum: the number of computing resources.. 2.. efficiency: an array of the computing speed of each resource.. 3.. sim_starttime: an array of the start time for each task group or task allocated on a specific resource.. 4.. sim_comptime: an array of the computation time for each task group or task. 39.

(49) allocated on a specific resource. 5.. sim_overtime: an array of the end time for each task group or task allocated on a specific resource.. 6.. simed: an array of flags used to check whether each task group or task has been simulated.. Functions:. 1.. HEFT(int dag, int task): implementation of the HEFT heuristic [16].. 2.. FST(int dag, int task):implementation of ourFST allocation method.. 3.. gap_search(int dag, int group): implementation of our adjustable best-fit task group allocation heuristic.. 4.. distributed_gap_search(int dag, int group): implementation of our adaptive distributed gap search heuristic.. 40.

(50) Chapter 6. Experiments and Performance Evaluation This chapter presents a series of experiments which evaluate the proposed task ranking and allocation methods in terms of average makespan through simulation studies. The proposed methods are compared to typical list-based and clustering-based workflow scheduling approaches, such as HEFT [16], the lookahead variant of HEFT [5], the pure best-fit task allocation approach [13], PCH [21][22], and the distributed gap search approach [18].. 6.1 Experimental Setting This section describes the experimental settings in our simulation studies, including DAG generation, and performance metrics. The simulation experiments were conducted on a PC equipped with a 2.6 GHz AMD Athlon(tm) dual core processor and 1.87GB RAM. We implemented a DAG generator to randomly generate workflows of fork-join DAG structure or general DAG structure for the following simulation experiments. The fork-join DAGs are generated as follows: 1.. The generator generates a DAG with one entry node and one exit node.. 2.. Each DAG contains one to four fork-join structures randomly.. 3.. Each fork operation produces two to ten branches randomly.. 4.. Each branch contains two to six nodes randomly.. 5.. It can generate DAG’s with different CCR values: 0.1, 1, and 10.. 6.. It assigns a random weight to each node and edge according to the specified CCR value.. 41.

(51) General DAGs are generated as follows. 1.. The generator randomly generates a DAG with four to five levels of tasks.. 2.. Each task level contains two to five nodes randomly.. 3.. Each task is randomly connected to some of the tasks at the next level.. In the following experiments, 1.. To simulate a speed-heterogeneous system, there are three different computing speeds, with the ratio 1:2:3, for the resources in the system.. 2.. Each node has the computation cost ranging from 1 to 30 seconds.. 3.. Each edge is assigned a communication cost based on the information of the computation costs and CCR of the entire workflow.. 4.. Each experiment was conducted for 30 times and the average performance value was calculated.. We use the average makespan of all workflows as the performance metric in the following experiments, where the makespan is defined to be the time between submission and completion of a workflow, including execution time and waiting time.. 6.2 Task Ranking and Allocation in List-Based Workflow Scheduling This section evaluates the proposed task ranking and allocation methods for list-based workflow scheduling. The proposed methods are compared with HEFT [16] and the lookahead variant of HEFT [5]. Figures 6.1 and 6.2 show the experimental results of different task ranking methods for general DAG’s and fork-join DAG’s, respectively, on three resources.. 42.

(52) To simulate speed heterogeneity, for each task different computation costs will be generated randomly for different resources. Figure 6.1 indicates that the proposed top + bottom ranking method outperform the bottom ranking approach used in HEFT [16].Figure 6.2 reflects two points. Firstly, it demonstrates that the top + bottom task ranking method does not work well with fork-join DAG’s, as illustrated in chapter 3. Secondly, our bottom-amount task ranking method is superior to the bottom ranking approach used in HEFT [16] for fork-join DAG’s.. Figure 6.1: Different task ranking methods for general DAG’s (HEFT). Figure 6.2: Different task ranking methods for fork-join DAG’s (HEFT). Figures 6.3 and 6.4 compare the performance of different task allocation methods on three resources. The experimental results show that our FST achieves better performance, compared to the EFT principle used in HEFT [16], for both general and fork-join DAG’s. Moreover, the performance improvement by FST is more significant when applied to fork-join DAG’s. For fork-join DAG’s, 74% of the randomly generated DAG’s can achieve better performance using FST instead of EFT, while only 51% of general DAG’s can benefit from FST.. 43.

(53) Figure 6.3: FST vs. EFT for General DAG’s. Figure 6.4: FST vs. EFT for fork-join DAG’s. However, Figure 6.5 shows an example which indicates that FST might not be as effective as in the above experiments for a more lightly loaded system. In Figure 6.5, there are four schedules resulting from scheduling the example fork-join workflow onto three or six resources using EFT or FST allocation methods, respectively. It’s clear that for the cases of three resources, FST leads to a better schedule than EFT. This is because that for the tasks on the critical path indicated by red line, {1, 2, 9, 13, 14, 15, 16, 17}, both the schedules produced by EFT and FST incur two communication costs, while the schedule of FST allows those tasks to run on the fastest resources, leading to a shorter makespan. On the other hand, the situation changes for the cases of six resources. Since FST does not consider communication costs when making allocation decisions, it would have higher probability of allocating tasks onto different resources, incurring communication costs. In the cases of Figures 6.5 (d) and (e), the tasks on the critical path, {1, 2, 9, 13, 14, 15, 16, 17}, incur three communication costs in the schedule of EFT, but lead to five communication costs in the schedule of FST. The higher incurred communication costs compromise FST’s benefits of allocating tasks onto the fastest resources, resulting in a worse makespan. In a shared parallel computing environment, such as grid and cloud, every user or each application can only acquire an uncertain portion of resources depending on the system load and resource status at. 44.

(54) that time. The above observation points out that the best task allocation choice might depend on the system load and the number of resources acquired. Therefore, task allocation for workflow scheduling becomes even more challenging in such shared parallel computing environments and requires further research efforts.. (a). (b). (c). (d). (e). Figure 6.5: (a) Example workflow (b) schedule by EFT on 3 resources (c) schedule by FST on 3 resources (d) schedule by EFT on 6 resources (e) schedule by FST on 6 resources. 45.

(55) Figures6.6 evaluates the effects of number of branches in fork-join DAG’s on the performance of FST. The experimental results show that the performance improvement achieved by FST increases as the number of branches grows. Figures 6.7evaluates the effects of branch length in fork-join DAG’s on the performance of FST. The experimental results indicate that the performance improvement is more significant with longer branch lengths. Figures 6.8 is an example for illustrating the effects of number of branches in fork-join DAG’s. The tables in the figure show the computation costs of tasks on different resources. Figure 6.8 (a) shows the schedules of a fork-join workflow with 2 branches, produced by the EFT and FST task allocation methods, respectively, while Figure 6.8 (b) is a comparative example with a fork-join workflow of three branches. In Figure 6.8 (a), it is clear that EFT tends to produce a schedule with less degrees of concurrency, all tasks being allocated on R2. In contrast, FST allocates tasks 2 and 3, which are on different branches, onto different resources, resulting in a higher degree of concurrency, and thus a shorter makespan. In Figure 6.8 (b), as the number of branches increases, FST produce a schedule with an even higher degree of concurrency than that in Figure 6.8 (a). Since a higher degree of concurrency has potential to achieve shorter makespan, this can explain why in general FST leads to larger performance improvement as the number of branches increases. Figures 6.9 is an example for illustrating the effects of branch length in fork-join DAG’s. We simply use a single branch in this example for illustration instead of an entire fork-join workflow. For the example workflow in Figure 6.9, task 1 can run fastest on resource R1, while other tasks run fastest on resource R2. Both task allocation methods, EFT and FST, allocate task 1 on R1. Since the communication costs are larger than the difference of computation costs among resources in this case, EFT tends to allocate all other tasks on the same resource, R1, for. 46.

(56) minimizing the total effects of communication and computation costs. However, this would lead to a worse situation where all the other tasks except task 1 run on a slower resource, result to a loner makespan. On the other hand, FST simply allocates a task on the fastest resource for it, allowing tasks 2, 3, and 4 to run on the fastest resource and leading to a shorter makespan. Since the affected number of tasks is proportional to the branch length, this can explain why in general FST achieves larger performance improvement as branch length increases.. Figure 6.6: Effects of different numbers of branches in fork-join DAG’s.. Figure 6.7: Effects of branch length in fork-join DAG’s.. Figure 6.8: Effects of different numbers of branches in fork-join DAG’s. 47.

(57) Figure 6.9: Effects of branch length in fork-join DAG’s. Figures 6.10, 6.11, 6.12, and 6.13evaluate the total performance improvement achieved by integrating the proposed task ranking and allocation methods. The integrated approach is compared with HEFT [16] and the lookahead variant of HEFT [5]. Figures 6.10 and 6.11 presented the performance evaluation for general DAG’s and fork-join DAG’s, respectively. The experimental results show that our approaches outperform both existing methods. The performance improvement for fork-join DAG’s is more significant. Figures 6.12 and 6.13 evaluate our approach with two well-known real-world workflow applications, Montage [7] and LIGO [2], respectively. The workflow structures of these two real-world applications are shown in Figures 14 and 15, respectively. Experimental results indicate that our integrated approach can achieve better performance, compared to existing approaches, for real-world workflow applications. In summary, our integrated approach can achieve up to 11.8% performance improvement.. 48.

(58) Figure 6.10: Evaluation of the integrated approach for General DAG’s. Figure 6.12: Evaluation of the integrated approach with Montage. Figure 6.11: Evaluation of the integrated approach for fork-join DAG’s. Figure 6.13: Evaluation of the integrated approach with LIGO. Figure 6.14: Montage. Figure 6.15: LIGO. 49.

(59) 6.3 Task Group Allocation in Clustering-Based Multiple Workflows Scheduling This section evaluates the proposed task group allocation methods for clustering-based multiple workflow scheduling, including enhanced best-fit task group allocation, adjustable task group allocation, and adaptive distributed task group allocation. The proposed methods are compared with the pure best-fit approach [13], PCH [21][22], and the distributed gap search approach [18]. Figures 6.16, 6.17, and 6.18 compare the continuous task group allocation methods under different CCR values. The experiments evaluate the proposed approaches on a 30-resource homogeneous system running 100 online workflows. The results indicate that in general our approaches, enhanced best-fit and adjustable best-fit, outperform previous methods and the best σ value varies under different CCR conditions. Experimental results show that σ=0.8 or 0.9 leads to the best performance improvement when CCR is 1 and σ=0.3&0.4 achieves the most obvious performance improvement when CCR is 10. When CCR is 0.1, different σ values make negligible performance difference. The results also indicate that larger CCR values lead to more significant performance improvement since larger CCR values imply longer idle time slots to accommodate task groups under continuous task group allocation, thus allowing different allocation decisions. On the other hand, when CCR is small, there are few idle time slots to accommodate entire task groups under continuous task group allocation. Therefore, different task group allocation methods make little difference. Looking at the performance of the adjustable best-fit approach under different CCR values, another insight is that the effect of the. 50.

(60) time slot fitness is more important than EFT when CCR is medium, since higher σ values lead to better performance as shown in Figure 6.17. This is because idle time slots large enough for task groups are limited under such CCR values, and thus efficient utilization of those time slots is crucial. On the other hand, when CCR is high there are plenty of large idle time slots. Therefore, the effect of EFT becomes more important as shown in Figure 6.18 where smaller σ values achieve better performance.. Figure 6.16:100 workflows on 30-resource homogeneous system with continuous task group allocation (CCR=0.1). Figure 6.17:100 workflows on 30-resource homogeneous system with continuous task group allocation (CCR=1). Figure 6.18:100 workflows on 30-resource homogeneous system with continuous task group allocation (CCR=10). Figures 6.19, 6.20, and 6.21 present the evaluation of the proposed adaptive distributed task group allocation method, compared to the original distributed task group allocation [18]. 51.

(61) under different CCR values. The results indicate that our adaptive distributed task group allocation outperforms the original distributed task group allocation method [18] significantly under various CCR conditions. Similar to the experiments for continuous task group allocation, when CCR is medium the effect of time slot fitness is crucial, as shown in Figure 6.20, and the effect of EFT becomes more important when CCR increases, as shown in Figure 6.21.. Figure 6.19: 100 workflows on 30-resource homogeneous system for different distributed task group allocation methods (CCR=0.1). Figure 6.20: 100 workflows on 30-resource homogeneous system for different distributed task group allocation methods (CCR=1). Figure 6.21: 100 workflows on 30-resource homogeneous system for different distributed task group allocation methods (CCR=10). The following presents the evaluation of the proposed task group allocation methods in a speed-heterogeneous parallel system. Figures 6.22, 6.23, and 6.24 evaluate the proposed enhanced best-fit and adjustable best-fit approaches. Figures 6.25, 6.26, 6.27 compare the. 52.

(62) proposed adaptive distributed task group allocation with the original distributed task group allocation method [17]. Similar to the results for homogeneous systems, our approaches, in general, outperform existing task group allocation methods in terms of average makespan. One thing to be noted is that the pure best-fit approach proposed in [13] performs poorly for larger CCR. When CCR is 0.1, PCH [21][22] and the pure best-fit approach achieve almost the same performance since in such case there are very few idle time slots large enough for allocating task groups. As CCR increases to 1, the pure best-fit approach outperforms PCH slightly, demonstrating the benefits of best-fit allocation through raising resource utilization rates. However, when CCR becomes even larger, 10 in this case, the pure-best approach performs poorly, compared to PCH. This is because for large CCR there are plenty of idle time slots large enough for allocating task groups. In such cases, best-fit and first-fit allocation would lead to similar resource utilization rate, but the first-fit principle in PCH can achieve better performance since the pure best-fit approach might delay tasks’ start time and in turn degrade the performance of entire workflow due to skipping some earlier available time slots to find the fittest one. On the other hand, out enhanced best-fit and adjustable best-fit approaches can deliver consistently better performance with all CCR values.. Figure 6.22: 100 workflows on 30-resource heterogeneous system for continuous task group allocation (CCR=0.1). 53.