中華大學

(1)

中華大學博士論文

應用於異質性平行系統上的通訊最佳化與工作排程技術

Optimizing Communications and Job Scheduling in Heterogeneous Parallel Systems

系所別：工程科學博士學位學程學號姓名：D09524004 陳泰龍

指導教授：許慶賢博士

中華民國九十九年九月

(2)

Optimizing Communications and Job

Scheduling in Heterogeneous Parallel Systems

By

Tai-Lung Chen

Advisor: Prof. Ching-Hsien Hsu

Ph. D. Program in Engineering Science Chung-Hua University

707, Sec.2, WuFu Rd., Hsinchu, Taiwan 300, R.O.C.

September 2010

(3)

摘要

在異質性計算環境，執行有效率的工作排程與廣播技術來增進整體系統效能是非常重要的。為了解決格網環境高效能的計算問題，主要的研究方向將著重於訊息廣播、工作排程、資源管理與服務品質。在變動的異質性網路環境，

工作排程與訊息傳送的方法，主要取決於網路架構與資料結構的不同而設計。

在本論文中，我們主要提出了無競爭通訊廣播技術、工作排程、工作合併與動態分配、服務品質導向之工作排程演算法。針對在不同異質性計算的網路系統：

包含了不規則網路、主從式架構與格網計算，設計不同的工作排程與訊息傳送的排程。本研究期望擁有最佳效能，主要考量異質性的網路頻寬負載，也考量了系統負載平衡避免重覆運算。我們設計的演算法改善了過去其他的工作排程方法，由於高異質性的傳輸頻寬與計算效能的不同，傳輸排程與工作排程的改善顯著的提升系統效能。另一個優點在於我們所設計的演算法，主要排程步驟是針對工作執行時間與工作傳輸時間的比例做分配，減少處理器閒置時間，並且設計動態排程進而提升處理器使用率，減少資源浪費。為了評估所提出演算法的效能，我們實作了論文中所設計的演算法，與參考文獻中著名的其他方法比較。實驗結果顯示在不同的網路架構中，我們所設計的演算法歸納有以下幾項優點；一、提高系統整體的執行產能，二、縮短訊息廣播與傳遞的時間，三、

減少處理器的閒置時間，四、提高處理器使用效率。

關鍵字: 無通訊競爭廣播，主從式架構，異質性計算，工作排程，服務品質導向

(4)

ABSTRACT

Job scheduling and broadcasting strategy are the important issues to improve system performance in heterogeneous systems. To investigate the problems of grid technologies in high performance computing, the message broadcasting, job scheduling, resource management and quality of services have always been the main challenges. In the variety of heterogeneous environments, the design of job allocating and message forwarding strategies depending on the network architecture and the construct of resources. In this study, the contention-free broadcasting, task scheduling, job grouping with dynamic dispatching and QoS guided job scheduling are proposed. We focus on the heterogeneous networks in different environments of irregular networks, master-slave model and grid computing. The main extensions of this study are the consideration of heterogeneous communication overheads and the system load balance. One significant improvement of our approach is that the system throughput could be maximized by decreasing the computation and communication overhead. The other advantage of the proposed method is that processors utilization can be increased with dynamic scheduling. To evaluate performance of the proposed techniques, we have implemented the proposed algorithms and compare with previous methods. The experimental results show that our techniques outperform other algorithms in terms of higher system throughput, minimum broadcasting time, less processor idle, and higher processors’ utilization.

Keywords : Contention-free broadcasting, Master-slave model, Heterogeneous processors, Task scheduling, Quality of service.

(5)

Acknowledgements

I would like to thank my research advisor, Prof. Ching-Hsien Hsu, for being a consistent source of support and encouragement.

Prof. Ching-Hsien Hsu is a conscientious and careful scholar. He also gives lots of suggestions not only for the dissertation but also for my attitude of dealing with things of graduate. One is fortunate to be one of Prof. Hsu’s graduate Ph.D student.

I would also like to thank members of P.D. Lab, they always give me support on the dissertation.

Special thanks go to my dissertation committee members. Each devoted significant time and effort to my dissertation, and their suggestions and comments led to substantial improvement in the final product.

Finally, I would like to thank my family to give me great support, without their encouragement and support, I can’t accomplish this dissertation with carefree minds.

(6)

List of Tables

Table 2.1: Parameters of sender and receiver. ...28

Table 4.1: An example of processed task recording table ...70

Table 5.1: Parameters and comparison metrics ...93

Table 5.2: Comparison of makespan in MOR ...94

Table 5.3: Comparison of resource used in ROR ...96

(9)

List of Figures

Figure 2.1: Example of switch based HNOW. ...8

Figure 2.2: Graph model for switch based HNOW. ...12

Figure 2.3: Communication model of peer-to-peer message passing...13

Figure 2.4: Workflow indication of the LABS method. ...13

Figure 2.5: Paradigm of an LST. ...16

Figure 2.6: The switch-based graph example ...20

Figure 2.7: Three contention-free links examples. ...21

Figure 2.8: Representational graph with node id. ...24

Figure 2.9: LST rooted by S₄...25

Figure 2.10: Adjusted LST rooted by S4...25

Figure 2.11: Completed scheduling tree rooted by workstation 9 ...26

Figure 2.12: Completed scheduling tree after exchanging nodes 10 and 11 ...26

Figure 2.13: Different number of workstations with small message (2048 flits) ...29

Figure 2.14: Different number of workstations with large message (10240 flits) ...29

Figure 2.15: Average communication latency with speed variations ...30

Figure 2.16: Analysis of hybrid environment with 25% type 8 and 25% type 1...31

Figure 2.17: Analysis of hybrid environment with 50% type 8 and 25% type 1...31

Figure 2.18: Analysis of hybrid environments with distributed bandwidth. ...32

Figure 3.1: The master slave architecture. ...37

Figure 3.2: Example of master-slave task scheduling ...40

Figure 3.3: Example of SCR task allocation ...43

Figure 3.4: Master slave tasking in heterogeneous network with deadline. ...45

Figure 3.5: The ESCR algorithm...48

Figure 3.6: Different task allocation schemes, SCR, ESCR and FPF...50

Figure 3.7: The binary approximation method of ESCR algorithm...50

Figure 3.8: Task allocation paradigm showing different startup waiting time. ...52

Figure 3.9: Performance results for different number of processors. ...53

Figure 3.10: Performance results of the algorithms under different system deadline .54 Figure 3.11: Performance results of the algorithms under different number of tasks .55 Figure 4.1: The multi-site server paradigm...60

(10)

Figure 4.2: The FCFS job scheduling on heterogeneous network...63

Figure 4.3: Min-Min job scheduling on heterogeneous network ...64

Figure 4.4: Max-Min job scheduling on heterogeneous network. ...64

Figure 4.5: Job grouping method by set-theoretic intersection...66

Figure 4.6: Job grouping example by JGDS ...67

Figure 4.7: JGDS job scheduling on heterogeneous network...69

Figure 4.8: JGDS multiple grouping workflow paradigm. ...69

Figure 4.9: JGDS-D job scheduling on heterogeneous network...72

Figure 4.10: Simulation results for different number of jobs in HPIC...74

Figure 4.11: Simulation results for different number of client nodes in HPIC ...75

Figure 4.12: Simulation results for different number of jobs in HPHC ...76

Figure 4.13: Simulation results for different number of client nodes in HPHC...77

Figure 4.14: Simulation results for setting value of similar factor ...78

Figure 5.1: The Min-Min algorithm. ...85

Figure 5.2: The QoS guided algorithm. ...86

Figure 5.3: Min-Min and QoS Guided Min-Min.. ...87

Figure 5.4: The MOR Algorithm. ...88

Figure 5.5: Example of MOR...90

Figure 5.6: Example of ROR...91

Figure 5.7: The ROR Algorithm. ...92

(11)

Chapter 1 Introduction

Broadcasting is a common operation in various network applications. The important requirements for efficient broadcasting are minimized communication latency and improved network utilization, which is called Quality-of-Service (QoS). Constructing a scheduling tree for broadcasting is a method used frequently, but it is an NP-Complete problem for getting an optimal scheduling tree. Networks of Workstations (NOW) support the superior properties of high bandwidth, scalability, flexibility and cost-efficiency. Most research projects were executed on homogeneous NOW [10, 39, 62 and 89], the architecture comprising of similar workstations in a single network. Due to advances in equipment such as workstations and network bandwidth, the environment of NOW has become more heterogeneous. The workstations on Heterogeneous Networks of Workstations (HNOW) have many parameters including CPU speed [83], memory size and communication speed. These parameters are crucial functions affecting the performance of broadcasting.

Grid technology has been recognized as an efficient solution to coordinate large-scale shared resources and execute complex applications in heterogeneous network environments. In recent years, the problem of scheduling tasks in heterogeneous networks has instigated researchers to propose different approaches. The task scheduling research on heterogeneous processors can be classified into DAG model, master-slave paradigm [16, 17] and computational grids. Grid computing is a collection of connected resources; they all contribute any combination of resources to the grid as a whole to perform a task. It appears to users as a large system, providing a single point of access to

(12)

powerful distributed resources. The most common resource is CPU cycles provide by processors of the machines on the grid and the processors are connected by different speed, architecture, software platform and other factors. Grid nodes are often geographically distributed, the major subject is to supporting fast data access speed on Grid networks.

The centralized computational grid system can be viewed as the combination of one resource broker and numbers of heterogeneous clusters. In master slave architecture [13], the resource broker can be viewed as a master node while those heterogeneous cluster nodes can be viewed as slave processors. Therefore, to investigate task scheduling problem, the master slave paradigm is a good vehicle for developing tasking technologies in centralized grid system.

Grid computing system that deals with data that integrated architecture that is a specialization and extension of the computational grid. The data grid was designed to satisfy the requirements of the grouping of large dataset sizes; geographically distant distribution of users and resources; and computationally intensive analysis that require evaluating and management large amount of data. The architecture also was developed to suite operations in a wide area, multi-institutional and heterogeneous environment. Data replication service seeks to improve the network traffic by copying heavily accessed data to appropriate locations and managing their disk usage. We can full utilize the bandwidth for max speed. Economy [22, 71] is an old, historical, sociological and mature system that influences areas of human life. There are some researches make use of economy grid computing to solve large-scale problems in the fields of high energy physics, molecular docking, computer micro-tomography and many others. Distribute fairly available resources is one of the problem in grid computing and economy model is a way to solve those problems about distribute resources.

(13)

1.1 Motivations

Job scheduling and message broadcasting are the most famous issues in heterogeneous environments. The data server nodes have the data that needs to be processed in the client at each processor. The job scheduling and broadcast technology is widely used in grid system design to send non-identical data to the clients join to processing. Therefore, numerous scheduling algorithms have been implemented for different architectures, such as DAG model, Master-Slave model [17] or Grid system.

Because of the new paradigm of heterogeneous computing, the traditional cluster in homogeneous bandwidth, such as parallel machines or PC clusters, can not be properly applied in heterogeneous systems. Therefore, it terms into great important to develop efficient scheduling and service techniques that adaptive in heterogeneous networks.

In general, there are three different types of job scheduling operation. 1.

Communicating bound scheduling: the source data or message can be broadcast sequentially by the bandwidth of communication links; 2. Computing bound scheduling:

jobs are sent to other processors that according to the priority of computing power; 3.

Computation and communication hybrid scheduling: the jobs were allocated to the client processors by the pre-scheduling scheme involving execution and transmission time.

Objective of the third type can minimize the makespan and maximize throughput than other methods.

In this dissertation, we present several algorithms aim to efficient perform computation scheduling and communication scheduling in heterogeneous network. The main idea of job scheduling algorithms is first to minimize the transmission cost in network and focus on the load balance. The different scheduling heuristics are implemented according to network characteristics.

(14)

1.2 Related Works

The main purpose of job scheduling is to achieve high performance computing and high system throughput. The former research focus on different topic at increasing execution efficiency and minimizing the makespan, increase the processing utilization of the systems or the quality of services. Beaumont et al. [13, 14] introduced the master-slave paradigm with task scheduling in heterogeneous processors. In order to minimize average turnaround time, Hsu et al. [49] presented the task dispatching method in heterogeneous systems. The pre-scheduling strategy was proposed for grid systems to minimize the system throughput [48]. Hsu et al. [47] proposed gene algorithm and recursive adjustment scheme in QoS economics for grid computing. Chen et al. [27]

presented the global grid load balancing in task scheduling based on master-slave model.

The computational grid [4] becomes a widely accept paradigm for large-scale parallel systems. Angelo et al. [5] focused on developing the scheduling techniques to minimize makespan of a broadcast operation in grid environment. Beaumont et al. [15, 16]

concentrated the broadcasting in heterogeneous platforms with one port model. To avoid contentions and schedule messages in minimal communication steps, Hsu et al. [46]

proposed a contention-free irregular redistribution scheduling in parallelizing compilers.

Thanalapati et al. [86] bring up an idea about adaptive scheduling scheme based on homogeneous processor platform which used space-sharing and time-sharing to schedule tasks. Han et al. [42] and Plank et al. [74] presented scheduling algorithms that enabled software fault tolerant and task migration for real-time environment. Recently, researches such as Dogan et al. [34] and Hagras et al. [41] discussed the task scheduling for heterogeneous computing based on DAG paradigm. Srinivasan et al. [84] addressed the scheduling problem with reliability optimization for general heterogeneous computer

(15)

systems. In [12], more investigations have been done based on incremental cost functions.

For QoS guided grid scheduling, apparently, applications in grids need various resources to run its completion. An architecture named public computing utility (PCU) is proposed in Asaduzzaman and Maheswaran [6] uses virtual machine (VMs) to implement

“time-sharing” over the resources and augments finite number of private resources to public resources to obtain higher level of quality of services. However, the QoS demands maybe include various packet-type and class in executing job. As a result, a scheduling algorithm that can support multiple QoS classes is needed. Based on this demand, a multi-QoS scheduling algorithm is proposed to improve the scheduling fairness and users’

demand Kim et al. [53]. A hybrid approach for scheduling moldable jobs with QoS demands were presented in He et al. [43]. A novel framework for policy based scheduling in resource allocation of grid computing was also presented in [50]. The scheduling strategy can control the request assignment to grid resources by adjusting usage accounts or request priorities. Resource management is achieved by assigning usage quotas to intended users. The scheduling method also supports reservation based grid resource allocation and quality of service feature. Sometimes the scheduler is not only to match the job to which resource, but also needs to find the optimized transfer path based on the cost in network. The distributed QoS network scheduler (DQNS) was presented to adapt to the ever-changing network conditions and aims to serve the path requests based on a cost function [71].

1.3 Achievements and Contribution of the Dissertation

In this dissertation, we present efficient task scheduling and broadcasting strategies for distributing tasks onto computing nodes in the underlying heterogeneous networks; and

(16)

present the performance and economization oriented scheduling techniques for managing applications with QoS demands in grid system. The cost and the speed are the reasons of customers consider solving their problem with grid computing. This dissertation focuses on how to control cost and time to satisfy clients' requirement when their request is reasonable. We proposed different algorithms to optimize the transfer time. The purpose of the proposed techniques are to minimize average turnaround time by dispatching tasks to processors with efficient task scheduling and considered the communication cost ratio. System throughput could be also enhanced by efficient broadcasting scheme. The proposed technique can be applied to heterogeneous systems as well as computational grid environments, in which the communication costs vary in heterogeneous networks.

1.4 Organization of the Dissertation

The rest of this dissertation is organized as follows: Chapter 2 describes a location oriented spanning tree construction and gives the scheduling tree for Location Aware Broadcast Scheme (LABS). Chapter 3 present a motivating example to demonstrate the characteristics of the master-slave tasking model and task allocation method, the Smallest Communication Ratio (SCR) task allocation algorithm and the enhanced method, called Extended SCR. We illustrate the new scheduling algorithm of Job Grouping and Dispatching Strategy algorithm (JGDS) and enhanced JGDS with Dynamic scheduling (JGDS-D) in Chapter 4. Chapter 5 illustrating the Min-Min and QoS guided Min-Min algorithms and these two optimization schemes correspond to different rescheduling approaches for reducing execution time of a batch of grid tasks and total resource cost are presented. Finally, Chapter 6 concludes this dissertation.

(17)

Chapter 2 Broadcasting Techniques on Irregular Networks

With the advance of network and computer techniques, scalable computing is becoming a new trend. For effectively integrating and utilizing distributed and heterogeneous resources, message broadcasting is an important and crucial technique in grid systems. In this section, a Location Aware Broadcast Scheme (LABS) for performing broadcasting in irregular heterogeneous networks is presented. The LABS introduces a new scheduling scheme based on the heterogeneity of workstations and network topology.

Together with a binomial tree optimization technique, LABS can arrange communication to be contention free while at the same time using the shortest routing path. To evaluate the performance of LABS, the proposed techniques were implemented along with other algorithms. The experimental results showed that LABS performed well in different various circumstances. LABS produced significant improvements in an environment with high heterogeneity.

2.1 Introduction

Based on the above descriptions of Heterogeneous Networks of Workstations (HNOW), it can be looked at as cluster computing. Cluster computing is a familiarly found computing environment consisting of many workstations connected by a Local Area Network (LAN). These workstations can be in many independent computer systems, they may be in the same place or in different places. Briefly, cluster computing is having these workstations work together as a single system. Cluster computing has many advantages, such as high performance, scalability, high throughput, system availability and

(18)

cost-efficiency. Nowadays, the connection speed of a switch between workstations can be as high as a gigabit per second. This makes cluster computing much faster with less communication latency. Hence, broadcasting on switch-based HNOW was investigated.

In this section, the main objective was to construct a performance-effective scheduling tree for broadcasting in switch-based HNOW. Figure 2.1 depicts an example of switch-based HNOW. A rectangle represents a switch with eight ports connect to either workstations or other switches. The links of switch-based HNOW were bidirectional between any two switches. This means that the opposite directions of every link connecting any two switches could be used simultaneously without link contention.

The three symbols represent the workstations connected to switches; each symbol represented a different speed. Hence, the aim was to design a scheduling tree that would broadcast the messages between workstations via the shortest paths and that the fastest workstations would be on its upper level.

S¹

S² S³

S⁸

S⁵ S⁴

S⁷ S⁶

Bidirectional Link Workstation

Switch

Figure 2.1: Example of switch based HNOW.

The wormhole routing technology with its low communication latency in order to receive a complete packet at a workstation and then send it to the next workstation,

(19)

wormhole routing advances the head of a packet directly from the incoming to the outgoing ports of the routing switch. This is a special case of cut-through switching which is a method for packet switching systems. The switch starts sending a packet before the whole packet has been received, as soon as the destination address is processed.

This method reduces the latency of the packet passing through the switch. In wormhole routing, a packet is divided into a number of flits (flow control digits) for transmission.

The size of a flit relies on system parameters, in particular, the port width of the switch.

The header flit (or flits) of a packet is the key point commanding the switch. Every time a workstation examines the header flit(s) of the message, it selects the next port of the switch and starts forwarding flits down that port. As the header advances along the specified switch, the remaining flits follow in a pipeline fashion. Although the wormhole routing causes low communication latency, both deadlocks and congestion occur. Link contention is also an essential problem for broadcasting in the network. While it does not arrange all workstations properly in the scheduling tree, two different workstations may use the same direction link to send the messages at the same time. This causes latency in broadcasting and decreases the performance of whole networks.

2.2 Related Works

Communication scheduling research on heterogeneous network can be designed in the virtual machine of job broker in Ethernet based cluster paradigm or grid computing.

Previous research mostly addressed execution on Networks of Workstations (NOW). A case of NOW is presented in [2]. Amit Singhal et al. present different approaches to deal with multicasting on switch-based HNOW with different speed types [83]. Nowadays, the connection speed of a switch between workstations can be as high as a gigabit per second [19]. Lin et al. present a scheme called TWO-VBBS [60] that combines two algorithms,

(20)

one is network partitioning [61] and another is VBBS [59]. Fleury et al. [38], Leonardi et al. [55] and Lysne [63] present methods to avoid deadlock and control congestion in wormhole routing networks. The wormhole routing technology with its low communication latency was presented in [31, 72]. There are several irregular switch-based networks, namely those of Myrinet [40] and ServerNet [45], for application in wormhole routing. Patarasuk et al. [73] and Lazzez et al. [54] focus on the efficient broadcast techniques in network. Faraj et al. [37] design the message scheduling strategy on switch based environments and Salinger et al. [79] consider the one-to-all message broadcasting in an interconnection network. These studies [69, 77] address various network applications using wormhole routing. Wang et al. [89-91] presented their methods on the tree-based and multicast-based algorithm on wormhole-routing. There are many researches focus on quick broadcasting [97] in wormhole-routing such as Zhuang et al. [99] proposed recursion-based broadcast paradigm, Xiang et al. [92] designed unicast-based fault-tolerant multicasting and Yang et al. [94] presented services-centric multicast. McKinley et al. [66, 67] used a binomial tree to implement the contention-free multicast. The communication schemes of the CCO algorithm [51] were proposed. A lot of research [39, 75 and 82] has been done using up/down routing as the basic routing in order to reduce link contention. Kesavan et al. [52] present a link contention-free binomial tree constructed with a partially ordered chain (POC) to order the switches in the network. Libeskind-Hadas et al. [57] constructed a contention-free ordered chain in depth contention-free routing on switch-based HNOW.

2.3 Research Architecture

On HNOW, the broadcast operation can be regarded as the constitution of a number of P2P communications. Therefore, to estimate the cost of broadcasting a message on

(21)

Sender

Sm

HNOW, the cost of each individual P2P communication needs to be formulated first.

Then, the overall cost of the broadcast can be evaluated by merging these individual costs.

In general, a P2P communication consists of three costs, including the message sending cost at the sender, the message transmission cost through the underlying network and the message receiving cost at the receiver. The cost of a single P2P transmission on HNOW is formulated as follows:

Tptp = Osend + Otrans + Oreceive (2.1) Osend = S_c^Sender + S_m^Sender m (2.2) ×

Otrans = X_c + X_m m × (2.3) Oreceive = R_c^Receiver + R_m^Receiver m × (2.4)

The transferred message size is m and Osend is the message sending cost at the sender, in which is the startup cost for the sender and denotes the message sending latency per byte. Otrans is the message transmission cost, in which represents the cost of the message through input and output switch ports and is the message passing through the switch latency per byte; and Oreceive is the message receiving cost at the receiver, in which means the startup cost for the receiver and is the message receiving latency per byte. The factors , and are the constant costs of the sender, transmission and receiver, respectively. The factors ,

and are dependent the transferred message sizes.

Sender

Sc

Receiver

Rm

Xc

R Xm

Xc Receiver

Rc R_m^Receiver

Receiver Sender c

Sc

Sender

Sm

Xm

Considering the above P2P communication performed on HNOW with multiple switches, the formulation (2.3) became inapplicable due to network latencies incurred by several switches not properly considered. To estimate the precise cost of a P2P communication, the message transmission cost given in equation (2.3) can be modeled as:

Otrans = X_trans m d + (× × X_switch m + 2 × × X_port) (d + 1) × (2.5)

(22)

The parameter d is the link length from the sender to the receiver. Xtrans is the latency of the message transmitting on a link between two switches, Xswitch is the latency of transmitting through a switch and Xport is the latency of passing through the switch port.

In equation (2.5), the overhead of transmission consisted of two parts, one was the latency of transmitting between switches, link cost and the other was the latency of transmitting through switches, switch cost. For example, Figure 2.2 shows the switch-based graph representation about the example of switch-based HNOW from Figure 2.1. A workstation connected to S1 as the sender sent the message to the receiver connected to S6. There were many paths from the sender to the receiver and S1 S2 S4 S6 was chosen for discussing the one-way cost of a single P2P message.

→ → →

S

¹

S

²

S

⁸

S

⁷

S

³

S

⁵

S

⁶

S

⁴

sender

receiver

Figure 2.2: Graph model for switch based HNOW.

In Figure 2.3, the sender and receiver each had different speed. The cost of the sender (2) was the latency before getting across S1 and the cost of the receiver (4) was the latency after passing through S6. The link length of this path was three, link cost was (Xtrans m 3) and switch cost was (Xswitch m + 2 Xport) (3 + 1). The cost of transmission was the combination of link cost and switch cost (5). Hence, the

× × × × ×

(23)

overheads of the sender, the receiver and transmission were integrated to get the one-way cost of a single P2P message (1).

S

1

S

2

S

4

S

6

sender receiver

(2)

(5)

(4)

(1)

Figure 2.3: Communication model of peer-to-peer message passing; (1), (2), (4) and (5) correspond to equations (2.1), (2.2), (2.4) and (2.5).

2.4 Location Aware Broadcast Scheme (LABS)

Location Aware Broadcast Scheme (LABS) major adapted in irregular and heterogeneous networks. LABS can avoid communication collision and using the shortest routing path. The method of LABS focuses on construct the spanning tree and improves message broadcast with high heterogeneity environment.

Figure 2.4 displays the LABS flowchart.

HNOW

(Translation)

LST SCT

(Optimization)

Not adjust the postorder list from LST

Adjust the postorder list from LST To construct LST for the

switch connecting to the source workstation

To construct SCT according to the postorder list which is

obtained from LST.

A Switch-Based HNOW

Figure 2.4: Workflow indication of the LABS method.

(24)

The LABS has two major phases in scheduling a broadcast on switch-based HNOW.

- Location Oriented Spanning Tree (LST) Construction

This phase constructs LST for each switch that is connected to the source workstation.

- Scheduling Contention-Free Broadcast, consists of three steps:

- Step 1: Generating postorder list of workstations from the LST.

- Step 2: Optional adjustment of the postorder list obtained in step 1.

- Step 3: Constructing the scheduling tree (SCT) according to the postorder list.

2.4.1 Location Oriented Spanning Tree (LST)

Every switch on switch-based HNOW had its own spanning tree in this construction.

Because the connections of each workstation were different, some switches had large degrees and some had small. The degree in this section represents the number of neighboring switches. If the single spanning tree is constructed by BFS, it only connects the workstations to the switches with large degrees and less communication latency to receive the messages. Nevertheless, the workstations connected to the switches with small degrees may waste more communication latency in receiving the messages. To solve the problem, a location oriented spanning tree (LST) was designed.

The basic idea of constructing the LST is that every switch chooses its neighboring switches as child node and the number of child switches is restricted. In the LST, the switches pick a limited number of child switches from their neighbor. This means that all levels of the LST had a restricted number of switches. The LST keeps a restricted width on all levels of the spanning tree, and it was proven that LST had a restricted depth of the spanning tree in Lemma 2.1. Broadcasting based on the LST has a better balanced load than BFS spanning tree due to the restricted width and depth. Consequently, it’s efficiently for switch which had its own spanning tree for broadcasting.

(25)

In constructing the LST, the switch connected to the workstation which broadcasting messages to others was the root. In the second step, the child switches were calculated for the root. The number of the root’s child was ⎡log2^s⎤, where s was the total number of switches. The root switch chose its child according to the principles of the child switches priority. The purpose of the four judgments was to determine the switches with larger degrees as high level of the tree as possible. The four judgments for selecting child switches ran from 1 (highest) to 4 (lowest).

Judgment 1 had two options: one was it would pick the switches with the maximum degrees as the child node, and another was that it would pick the switch with the minimum degree as the last child node. Judgment 1 not only enabled the switches with larger degrees to be a higher tree level but also avoided picking the switches with lower degrees.

Judgment 2 had two options: one was that would pick up the switches connected to the smallest number of neighboring that had been picked from the first child node before the last one. Another was that it would pick the switch connecting to the largest number of neighboring that had been picked as the last child node. Judgment 2 made the switches of the LST having a better chance at finding their child switches. It would pick the switch connected to the maximum number of workstations in judgment 3. The last judgment was to pick the switch which has the smallest id as the child switch.

Figure 2.5 demonstrates the paradigm of an LST, where the number of the root’s child was with the child furthest to the left was A. As A’s child was − 1, the number of its right siblings was one less. Therefore, node B’s child was − 2, and node C’s child was − 3, and so on. As a result, the child of the root which was furthest to the right L, was zero. For the lower levels of the LST, this rule was followed in constructing its sub-trees.

⎡log2^s⎤ ⎤

⎤

⎡log2^s

(26)

Root switch

－ 2

－ 1

－ 3 … ₀

－ 2

……

－ 3 … ₀

,

, ,

,

…… …… ……

－( )^{－ 1} ……

0

……

,

…

⎡

log2^s

⎤

,

⎡

^log²^s

⎤

⎡

^log²^s

⎤

⎡

^log²^s

⎤

⎡

log2^s

⎤ ⎡

log2^s

⎤

⎡

log2^s

⎤ ⎡

log2^s

⎤

A B C L

Figure 2.5: Paradigm of an LST.

In the third step, each child was made to switch a root from left to right. Every new root switch would perform the above four judgments to pick its child switches until a complete LST was constructed. If the degree of the root switch was smaller than the restricted number of child switches, the next root switch had fewer child switches. For example, when the number of switches of the root switch was three, its degree was two.

The restricted number of child switches of the next root switch changed two into three. In the last step, it was checked whether all switches were in the LST or not. If there was any switch missing, it used three judgments to choose its father switch. All switches connected to the missing switch were searched first. Judgment 1 chose the switch with the biggest value as its restricted number of child switches. Judgment 2 took the switch with the minimum number of workstations and child switches, and judgment 3 picked the switch on the upper tree level of the LST as the father switch. The reasons for these judgments were: to insure that the missing switch connects to the neighboring switch, and to keep the depth and width of the LST as restricted as possible. After all these judgments

(27)

for picking the child or father switches, the LST for the root switch connected to the source workstation was constructed. Algorithm 2.1 demonstrates how to construct the LST in the next paragraph.

Algorithm 2.1 //Location Oriented Spanning Tree

S = {S1, S2, S3, …, SN}, where Si is the switch and N is the number of switches on a given switch-based HNOW.

C = {Ci1, Ci2, Ci3, …, Cij}, where Cij is the child switch of the switch Si and j is the number of child switches of the switch Si.

| Ci | = the restricted number of child switches of switch Si.

| RCi | = the real number of child switches after Si picks the child switches.

Dmax = max {ds1, ds2, …, dsN}, where dsi is the degree of switch Si. Dmin = min {ds1, ds2, …, dsN}, where dsi is the degree of switch Si.

| S^d_max^si |: the number of switches whose degree is equal to Dmax.

| S^d_min^si |: the number of switches whose degree is equal to Dmin.

| NSsi |: Si is the number of neighboring switches that have been picked.

W = {ws1, ws2, ws3, …, wsi}, where wsi is the collection of workstations of the switch Si and | wsi | is the number of workstations of switch Si.

Wmax = max {ws1, ws2, ws3, …, wsi}.

| S^|^w_max^si^| | = the number of switches with workstations equal to Wmax.

Step 1. Let Si be the root switch of the LST, it has x immediate child switches Ci1, Ci2, …, Cix-1, Cix.

(a) Find x-1 switches as Si’s child switches (Ci1, Ci2, …, Cix-1).

According to the following criteria:

1 Dmax; // largest degree has highest priority If (| S^d_max^si | > 1) then compare | NSSi |,

2 | NSsi |; // smallest | NSsi | has highest priority

If (2 switches have the same values) then compare W⁺ max, 3 Wmax // maximum workstations highest priority

If (| S^|_max^w^si^| | > 1) then compare the id number of the switch,

(28)

4 Switch id; // smallest id has highest priority (b) Find one switch as Si’s right most child switch (Cix).

1 Dmin; // smallest degree has highest priority If (| S^d_min^si | > 1) then compare Wmax,

2 | NSsi |; // largest | NSsi | has highest priority

If (2 switches have the same values) then compare W⁺ max, 3 Wmax; // maximum workstations have highest priority

If (| S^|_max^w^si^| | > 1) then compare the id number of the switch, 4 Switch id; // smallest id has highest priority

Step 2. Repeat step (1) to construct remaining sub-trees in LST.

Step 3. Check whether all switches are in the LST.

If (Si is the switch missed to pick into the LST) then find one switch as Si’s father switch Sj,

1 | Cj | – | RCj |; // maximum value has highest priority

If (2 switches have the same values) then compare | RC⁺ j | + | wsj |, 2 | RCj | + | wsj |; // minimum value has highest priority

If (2 switches have the same values) then compare the level of LST, ⁺ 3 The level of the LST // the upper level has highest priority

Step 1 shows four judgments for choosing child switches and it obtains the number of child switches according to its father switch or the switch to its left. Step 3 checks whether any switch was not picked and finds the father switch of the missing switch in the LST. This algorithm continues until the complete LST is constructed. Lemma 2.1 states the depth of an LST as limited by ⎡log2^s⎤ + 1, where s is the number of switches.

Lemma 2.1: The maximum depth of LST is ⎡log2^s⎤ + 1, where s is the number of switches.

Proof: Assume that L₁, L2, …, LD are the number of switches at each level of an LST and D is the depth of an LST, we have the following equation,

(29)

⎤

L1 + L2 + L3 + … + LD = 2^⎡^log²^s^⎤

⇒1 + ⎡log2^s + L3 + … + LD = 2^⎡^log²^s^⎤

⇒L3 + … + LD = 2^⎡^log²^s^⎤ – 1 –_⎡log₂s_⎤………(2.6) According to (2.6), we can have

L3 =

∑

⁻ , L4 = , L5 = ,

= 1 −

L 1

2

2 (L )

a

a (L )( 1)

1 L

2 2

2 − −

∑

⁻

=

a a

a

) 2 ( ) (

1

3

−

∑

⁻ −

=

a a

a L2

L2

M

LD = ( )( ( 2) 1)

1

2

+

−

∑

⁻ −

−

=

D L

L2

D

2 a a

a

⇒L3 + L4 + L5 + … + LD =

∑ ∑

………(2.7)

=

−

=

+

−

D

i L

i

2

i L

3 1

2

) 1 ) 2 ( )(

(

a

2 a

To prove ⎡log2^s⎤ ≥ D – 1 is true, we use the mathematical induction.

If D > ⎡log2^s⎤ + 1, assume that s = k is true, ^∀k∈ Ζ, k > 1, we have D > + 1.

So, let D = + 2,

⎡log2^k⎤

⎤

⎡log2^k

by equation (2.7), we have L3 + … + LD =

⎡ ⎤ ,

⎡ ⎤

∑ ∑

⁺

=

−

=

+

−

2

3

1

2 2

) 1 ) 2 ( )(

(

k

i

k

i

i k

log log

a

a a log

by (2.6), we have L3 + … + LD = 2^⎡^log²^k^⎤ – 1 – ⎡log2^k⎤ That is 2^⎡^log²^k^⎤ – 1 – _⎡log₂k_⎤ = ^⎡

∑ ∑

^⎤⁺ ^⎡ ^⎤ ⎡ ⎤

=

−

=

+

−

2

3

1

2 2

) 1 ) 2 ( )(

(

k

i

k

i

i k

log log

a

a a log

If s = k + 1, we can have ⎡log2⁽k+¹⁾⎤ > ⎡log2^k⎤ and assume ⎡log2⁽k+¹⁾⎤ = ⎡log2^k⎤ + 1.

Assume that k = k1 is true, ∀ k1∈ Ζ, k1 > 1,

⎡ 2 ¹

2 ^log ^k^⎤ – – = 0, which leads the

following equation,

⎡log2k¹⎤

⎡ ⎤

∑

⁺ ⁺

∑

=

− +

−

=

+

−

2 1) (

3

1 1) (

2

2 1 2 1

) 1 ) 2 ( (

k

i

k

i

log log

a

⎡ ⎤

∑

⁺ ⁺

∑

=

− +

−

=

+

−

2 1) (

3

1 1) (

2

2 1 2 1

) 1 ) 2 ( (

k

i

k

i

log log

a

a = 2^⎡^log²^k¹^⎤_{– ⎡}log₂k¹_⎤

If k = k1 + 1 and ⎡log2⁽k¹+¹⁾⎤ > ⎡log2k¹⎤, we have ⎡log2⁽k¹+¹⁾⎤ = ⎡log2k¹⎤ + 1.

(30)

⎤

Because 2^⎡^log²^k¹^⎤ – ⎡log₂k¹ > 0 and 2^⎡^log²⁽^k¹⁺¹⁾^⎤_{– ⎡}log₂(k¹+1)_⎤ > 0, we can obtain > 0, which is a contradiction.

⎡log² 2(k¹⁺2)

log

⎤⁻ⁱ⁺²⁾⁾

⎤

s

⎡ ⎤

∑

⁺ ⁺

= 2) (

3

2 1

(

k

i log

⎡ 2

Hence, we get + 1 ≥ D  Below is an example demonstrating this mechanism. Assume that a workstation connected to S1 is the source broadcast for the message as shown in Figure 2.6(a). S1 is the root of the LST, and its number of child switches is ⎡log2⁸⎤ = 3. Based on the four judgments from algorithm 2.1, the child switches of S1 are S7, S3 and S8. The numbers of these child switches are 2, 1 and 0 respectively. S7 and S3 choose their child switches by three judgments in turn. The child switches of S7 are S5 and S6; the child switch of S3 is S2. Finally, S5 picks S4 as its child switch. The complete switch-based LST rooted by S1

is shown in Figure 2.6(b).

S

¹

S

²

S

⁸

S

³

S

⁵

S

⁶

S

⁴

S

⁷

S

¹

S

⁷

S

³

S

⁸

S

⁵

S

⁶

S

²

S

⁴

(a) (b)

source

Figure 2.6: The switch-based graph example (a) Graph representation (b) The switch-based LST rooted by S1.

(31)

2.4.2 Contention-Free Broadcast

The performance of broadcasting is affected by link contention. Many studies have been done using up*/down* routing to reduce link contention in order to get a better performance. In this study, three ways for making a routing link contention-free were explored. In the first step, all senders and receivers were connected to the same switch, as shown in Figure 2.7(a). In the second, all senders were connect to the same switch and receivers to other switches, as shown in Figure 2.7(b), the senders at the same time send the messages to the receivers without link contention. In the last instance, two messages were sent from different sender switches to different receiver switches, as shown in Figure 2.7(c). Since both messages were passed through switch 1 at the same time, it did not cause any link contention in switch 1.

switch 1 sender

receiver receiver sender

sender sender

receiver

receiver receiver receiver

(a) (b) (c)

Figure 2.7: Three contention-free links examples

After constructing the LST, some information was obtained from it for the construction of scheduling tree. For instance, the postorder list was obtained from the LST. Also, the workstation with the fastest speed could be determined so that a switch could be established as a group head, and then the postings could be grouped accordingly.

(32)

The first step in the construction of a scheduling tree, is acquiring the postorder of the workstations from the LST. The message routing in an ordered chain does contend in simultaneous links between different workstations. The postorder list of workstations from the LST is an ordered chain; this is used to construct the scheduling tree with contention-free links. In the second step, the fastest workstation for each switch is taken as the group head. The members of a group are the workstations connected to the same switch. Every group has its sub-tree group size that is the number of workstations in the sub-tree. In the third step, these sub-tree groups were adjusted according to their group size, decreasing from left to right. The adjusted list placed the large size groups in the upper level of the scheduling tree and the width and depth of the scheduling tree closer to the binomial-like scheduling tree. After adjusting the postorder list of workstations, the first level of the scheduling tree was constructed with all group heads using binomial-like scheduling. Then, each group constructed its scheduling tree, i.e. the second level scheduling tree, using the binomial-like scheduling until the scheduling tree construction is completed. In the last step, all the fast workstations were arranged to be the father workstations of the slow workstations; these workstations were then responsible to send the messages to the other slow workstations. This made the scheduling tree broadcasting more efficient. The links on the first level of the scheduling tree was still contention-free, even if the postorder list was adjusted. Because the workstations on the first level of the scheduling tree all were connected to the different switches, it was impossible to generate link contention between any two workstations sending the messages at the same time.

The links of the second level of the scheduling tree were also contention-free because the workstations in the same group were connected to the same switch. See Figure 2.7(a).

The workstations in this scheduling tree had only two connections: one was to those connected to neighboring switches and the others were connected to the same switch.

中 華 大 學

中 華 大 學 博 士 論 文

應用於異質性平行系統上的通訊最佳化與工 作排程技術

Optimizing Communications and Job Scheduling in Heterogeneous Parallel Systems

系 所 別：工程科學博士學位學程 學號姓名：D09524004 陳泰龍

指導教授：許慶賢 博士

中 華 民 國 九十九 年 九 月

Optimizing Communications and Job

Scheduling in Heterogeneous Parallel Systems

By

Tai-Lung Chen

Advisor: Prof. Ching-Hsien Hsu

Ph. D. Program in Engineering Science Chung-Hua University

707, Sec.2, WuFu Rd., Hsinchu, Taiwan 300, R.O.C.

September 2010

摘要

ABSTRACT

Acknowledgements

Table of Contents

List of Tables

List of Figures

Chapter 1

Introduction

1.1 Motivations

1.2 Related Works

1.3 Achievements and Contribution of the Dissertation

1.4 Organization of the Dissertation

Chapter 2

Broadcasting Techniques on Irregular Networks

2.1 Introduction

2.2 Related Works

2.3 Research Architecture

S

S

S

S

S

S

S

S

S

S

S

S

2.4 Location Aware Broadcast Scheme (LABS)

HNOW

LST SCT

2.4.1 Location Oriented Spanning Tree (LST)

⎡

⎤

⎡

⎤

⎡

⎤

⎡

⎤

⎡

⎤ ⎡

⎤

⎡

⎤ ⎡

⎤

∑

∑

∑

∑

∑ ∑

∑ ∑

∑ ∑

∑

∑

∑

∑

∑

S

S

S

S

S

S

中華大學

中華大學博士論文

應用於異質性平行系統上的通訊最佳化與工作排程技術

系所別：工程科學博士學位學程學號姓名：D09524004 陳泰龍

指導教授：許慶賢博士

中華民國九十九年九月