Concentrations, Load Balancing, Multicasting and Partial Permutation Routing on Hypercube Parallel Computers

(1)

693

Concentrations, Load Balancing, Multicasting and Partial

Permutation Routing on Hypercube Parallel Computers

GENEEUJAN,FRANKYEONG-SUNGLIN*_,_M_ING-_B_O_L_IN** ANDDERONLIANG

Department of Computer Science National Taiwan Ocean University

Keelung, 202 Taiwan

E-mail: {B0199, drliang}@mail.ntou.edu.tw *_{Department of Information Management}

National Taiwan University Taipei, 106 Taiwan

**_{Department of Electronic Engineering} National Taiwan University of Science and Technology

Taipei, 106 Taiwan E-mail: [email protected]

Some basic algorithms on hypercube interconnection networks are addressed and then applied to concentration, superconcentration, multicasting, partial permutation routing, and load balancing problems in this paper. The results show that both concentra-tion and superconcentraconcentra-tion problems can be solved in O(n) time and the multicasting and partial permutation routing problems in O(n2_{) time with O(1) buffers for each node,} where n is the dimension of hypercube interconnection networks. The load balancing problem based on superconcentration can be solved in O(Mn) time, where M is the maximum number of tasks in each node.

Keywords: concentration, hypercubes, interconnection networks, load balancing, multi-casting, partial permutation routing, permutation networks, superconcentration

1. INTRODUCTION

An n-dimensional hypercube, denoted byHn, has N = 2nprocessors. Each processor

inHn can be labeled with an n-bit binary number. An example of a three-dimensional,

eight-processor hypercube is shown in Fig. 1. Two processors are called adjacent and connected by a link or communication channel if their labels differ in only one bit posi-tion [22]. An N-processor hypercube has log2 N diameter and N log2N communication links since every processor is connected to its log2N neighbors. A hypercube system is a network of computers in which processors are located at the nodes of the hypercube and communication channels between processors are the links of the hypercube. Every proc-essor is a RAM (random access machine) with some local memory and can perform any of the basic operations such as addition, subtraction, etc., in one unit of time. These unique topological properties make it a strong candidate for multiprocessor systems. In fact, several commercialy available multiprocessor systems are based on the hypercube architecture.

Received September 3, 2001; accepted April 15, 2002.

(2)

Fig. 1. A 3-dimensional, 8-processor hypercube.

Hypercube systems can be classified into two major versions, depending on how many neighbors one processor can communicate with in one unit of time. In the first ver-sion, known as the sequential hypercube or single-port hypercube, it is assumed that a processor can communicate with only one of its neighbors in one unit of time. In contrast, the second version, known as the parallel or multiport hypercube, assumes that a proces-sor can communicate with all of its n neighbors in one unit of time. In this paper, we as-sume that synchronous single-port hypercube systems are used.

The packet-routing problem on any interconnection network is of great important once. This problem involves how to transfer the right data to the right place within a rea-sonable amount of time [14]. To measure the routing capability of an interconnection network, the PPR (partial permutation routing) problem is usually used as the metric. In the PPR problem, each processor is the origin of at most one packet and each processor is the destination of no more than one packet [8].

The data concentration problem, in general, has two types: concentration and su-perconcentration [4-6, 17, 21]. The former involves mapping k packets from any k essors into the first k processors without the capability to distinguish among those proc-essors, while the latter concerns how to map k packets from any k processors into some k speific processors. The prefix-sum problem involves computing the sums ,

0

∑

=

= k_i i

k s

p

for all k between 0 and N− 1, given N numbers s0, s1, …, sN-1.

In this paper, we propose a new deterministic PPR algorithm based on concentration for hyperube interconnection networks. The essence of this algorithm is based on the repeated concentration applications to separate the input packets into two groups: one with bk= 0 and the other with bk= 1, where bkis the kth bit of the destination addresses

of input packets and k runs from n − 1 to 0. That is, the input packets are sorted in a manner like binary MSB (most significant bit) radix sorting. Hence, they will be routed to their proper destinations after n steps. The routing of these packets is carried out by a sorting operation on the “fly.” The results show that this algorithm can solve any PPR problem in hypercube interconnection networks in the time complexity of O(n2) with only three buffers on each node.

A multicast routing algorithm which can realize all N-element one-to-many assign-ments in polylogarithmic routing time using additional multicasting tags and duplication codes is addressed as well.

We also apply the superconcentration algorithms to the load balancing problem in hypercube interconnection networks. The result shows that it can be solved in O(Mn)

(3)

time, where n is the dimension of hypercube interconnection networks and M is the maximum number of tasks in each node.

The rest of the paper is organized as follows. Section 2 describes a general routing paradigm that will be used throughout the paper. Section 3 applies the general routing paradigm to both concentration and superconcentration problems. Section 4 considers the PPR problem. Section 5 addresses the multicst routing problem. Section 6 demonstrates how the load balancing problem can be solved using the superconcentration concept and broadcast and data reduction algorithms developed in the previous section. Section 7 concludes this paper.

2. GENERAL ROUTING PARADIGM

In this section, a general routing paradigm [10, 13] that will be used throughout the paper for solving the PPR problem on hypercube interconnection networks is described. 2.1 General Idea

The basic concept of the general routing paradigm is based on the observation that any routing operation, regardless of static or dynamic interconnection networks are used, can be cast into a sorting operation followed by a monotone routing operation. The for-mer is to generate a monotone sequence according to the destination addresses of input packets; the latter is to route the packets according to the monotone sequence generated to their destinations without blocking. Of course, these two operations can be combined into one if the sorting operation is modified appropriately. In the following, we will detail how to do this.

Based on the above idea, one way to route input packets to their destinations on an (N, N) routing network, without blocking, is to separate the inputs into two groups ac-cording to their kth-bit values recursively, where k is runs from n− 1 to 0, and n = log2N. More specifically, the destination addresses of input packets are sorted by using binary radix sort.

To separate the inputs into two groups so that the kth bit of the packets are 0 and 1, an ( ₁, ₁) 0/1

2 2n−k− n−k− −

N

N _{separator is used. As shown in Fig. 2, if log}

2Nstages of 0/1 spa-rators, with each containing ₂k+1(₂n−k−1, ₂n−k−1)−0/1

N N

N _{separators, are cascaded, the}

re-sulting is a general routing network. Each (₂n−k−1, ₂n−k−1)−0/1

N

N _{separator receives}

1 2n−k−

N _{packets from its inputs and outputs those packets with a kth bit of 0 to the upper}

half outlets and 1 to the lower half outlets. That is, it groups the input packets into two subgroups by using the status of the kth bit of the destination addresses of input packets. By repeating the above operation n times, all packets will be properly routed to their des-tinations. d

The above procedure is an application of binary radix sort. However, the problem left is how to construct an (₂n−k−1, ₂n−k−1)−0/1

N

N _{separator so that it can perform}

the correct operation that sorts the destination addresses on the basis of each binary digit of the addresses. The following section will detail the construction and properties of the

1 / 0 ) , ( ₁ ₁ 2 2n−k− n−k− − N N _separator.

(4)

Fig. 2. The general routing paradigm for PPR problem.

2.2 0/1 Separator

As described previously, the main function of the ( ₁, ₁) 0/1 2

2n−k− n−k− −

N

N _{separator is}

to classify input packets into two groups based on the values of the kth bit of their desti-nation addresses. The basic structure of an (₂n−k−1, ₂n−k−1)−0/1

N N _{separator is composed} of an ( ₁, ₁) 2 2 2n−k− n−k− N

N _{distributor and two} ₍ _, ₎

2 2nk 1 nk N N − −

− concentrators, as shown in Fig.

3. The function of the( 1, ₂ 1) 2 2n−k− n−k−

N

N _{distributor is to divide the input packets into two}

subgroups with each of them containing a kth bit of ‘0’ and ‘1’. This circuit can be easily done with 1-to-2 demultiplexers controlled by the kth bit, as shown in Fig. 3. However, it doubles the “network bandwidth.” To recover the original network bandwidth, two

) , ( 2 2nk1 nk N N − −

− concentrators are used, one for each subgroup.

Each ( , ) 2 2nk 1 nk N N − − − concentrator is composed of (₂n−k−1, ₂n−k−1) N N _{prefix sum} network, ( ₁, ₁) 2 2n−k− n−k− N

N _{parallel subtracter, and} ₍ _, ₎

2 2nk1 nk N N − −

− indirect binary cub

network, as shown in Fig. 3. Each (₂n−k−1, ₂n−k−1)

N

N _{prefix sum network accepts at most}

1 2n−k−

N _{packets, computes the outlet rank of each packet, and then output packets along}

with their ranks. In order to remove the redundant rank outputs in the outlets in which no packets appeared in their corresponding inputs of the prefix sum network,

) , ( ₁ ₁ 2 2n−k− n−k− N

N _{parallel subtracter is used to adjust the rank of each packet to be started}

from an appropriate value determined by which iteration being processed, and delete the rank values of the outlets in which no packet received in their corresponding inputs so that the output of parallel subtracter is a monotone rank. The rank then serves as the routing tag for each packet to route itself to the outlet of the ( ₁, ₁)

2 2n−k− n−k− N N _{− 0/1} separa-tor through (₂n−k−1, ₂n−k−1) N

(5)

Fig. 3. The general structure of an (N, N)− 0/1 seperator.

that (₂n−k−1, ₂n−k−1)

N

N _{− 0/1 separator only contains}

1 2n−k− N _{outlets, each} ₍ _, ₎ 1 1 ₂ 2n−k− n−k− N N

indirect binary cube network indeed requires only nk

N

−

2 outlets. However, it is clearer to keep the entire network intact.

2.3 Prefix Sum Network

Although several approaches can be used to carry out the prefix rank computation required for each packet to be routed through the indirect binary cube network, the fol-lowing approach is used in the general paradigm for our convenience. In the network shown in Fig. 4, each prefix cell carries out the following computations:

i in j in j out j in i in i out i in i out s p p s s s p p + = + = = i out j in i in j out s s s s = + = (1)

After log2Nsteps, the prefix sum pouti appears in the outlets of the network.

2.4 Indirect Binary Cube Networks

An (N, N)-indirect binary cube network, where N is a power of 2, is recursively formed by cascading two (N/2, N/2)-cube networks with a stage ofN/2 (2, 2)-switches such that one inlet of each switch is connected to one (N/2, N/2)-cube network, and its other inlet is connected to the other (N/2, N/2)-cube network. When the (N/2, N/2)-indirect binary cube networks are recursively decomposed, an (N,N)-cube network

(6)

Fig. 4. An (8, 8)-prefix sum network example.

consists of logN stages each withN/2 (2, 2)-switches. Fig. 5 depicts a recursively de-composed 8-inlet cube network. A detailed description of cube networks can be found in [11, 20, 25].

Fig. 5. An 8-inlet indirect binary cube network.

2.5 Cost and Performance

To calculate the cost and time complexities of an (N,N) general routing paradigm shown in Fig. 2, it is necessary to detail the running time of each step. As shown in Fig. 2,

(7)

the general routing paradigm consists of log2N0/1 separator stages, with each 0/1 sepa-rator being composed of four subcircuits. These are (N,N) distributor, (N,N) prefix sum network, (N,N) parallel subtracter, and (N,N) indirect binary cube networks. An (N,N) distributor and (N,N) parallel subtracter cost one unit time; an (N,N) prefix sum network and (N,N) indirect binary cube network cost log2 Nunit times. Therefore, the running time of the (N,N) general routing paradigm is O(log22N).

3. CONCENTRATION AND SUPERCONCENTRATION

A concentration problem is one that involves how to route k packets from any k (source) processors to some fixed k consecutive destination processors on an intercon-nection network without the ability to distinguish among those destinations. A supercon-centration problem is concerned with how to route k packets from any k processors to any k specified processors without the capability to distinguish among those destinations. The main difference between concentration and superconcentration problems is that the input packet sets of the latter can be mapped into any subset of destinations on a one-to-one basis, while the input packet sets of the former can only be mapped into some fixed, but not any arbitrary subset of destinations.

3.1 Concentration

In general, the operation of a concentration algorithm can be divided into two steps: prefix-sum and routing. The former determines the destination node of each packet while the latter routes packets to their destinations. On the basis of this paradigm, the concen-tration operation in hypercube interconnection networks can be carried out in two phases. In the first phase, k packets from k processors are ranked by thePrefix_Sumalgorithm, which can be done in O(n) time. In the second phase, the ranks of the k packets are used to route the k packets to k consecutive processors starting from processor PE0. By using

theIndirect_Binary_Cube_Routing, any k packets can be routed from their processors to the k consecutive processors starting from the topmost processor in O(n) time, where k ≤ 2n

= N.

The detailed algorithm is described as follows. Algorithm: Concentration(d, pid, s)

Begin

{All operations are carried out in processor PEi+pidand use Bi+pid(0) as buffer. Begin

Step 1: for all PEi+pid(0≤ i < 2d) do in parallel {Prefix ranking tree.} 1.1: for k = 0 to d− 1 do

1.1.1: if i + pid <(i+ pid)k then do {Compute (2, 2)-prefix sum.}

1.1.1.1: ;

) ( )

(i pid pi pid si pid

p ₊ k = ₊ k + +

1.1.1.2: ;

) ( )

(i pid i pid i pid pid

i s s s

s+ = ₊ k = ₊ k + +

end if

Step 2: for all PEi + pid(0≤ i < 2d) do in parallel {Parallel subtracter.}

(8)

then rri+pid= pi+pid+ pid− 1; else rri+pid= pi+pid+ (pid + 2d-1)− 1. end for

Step 3: for all PEi + pid(0≤ i < 2d) do in parallel {Indirect binary cube network.} for m = 0 to d− 1

3.1: if bm= 0 or rri+pid= 0 then {Perform the actual routing operation.} forward the packet in Bi+pid(0) to thePE(i+pid)(bm=0)(upper outlet.) else forward the packet in Bi+pid(0) to thePE(i+pid)(bm=1)(lower outlet.) end for

End {of Concentration}

It is easy to see that the above concentration algorithm can route k packets from any processor to the topmost k processors in O(n) time.

3.2 Superconcentration

A superconcentration algorithm on hypercube interconnection networks can be ob-tained by performing the concentration algorithm two times and thus operates in three phases. In phase one, the k packets from any k processors are concentrated to the topmost k processors. In phase two, the k destination addresses are used as dummy packets and are concentrated in the topmost k processors also. The outputs in both phases must meet in a one-to-one basis exactly in the same k processors because the same number of pack-ets are concentrated from both phases. Phase three involves routing the k packpack-ets to their destinations using the known routing information obtained in Step 2.

The detailed superconcentration algorithm is described as follows. Algorithm: Superconcentration_on_Hypercube(n)

Begin

Step 1: {The k packets from any k processors are concentrated in the topmost k processors.}

1.1: for all PEi(0≤ i <k/n) do in parallel

1.1.1: move Bi(1) and Bi(0); {Initial values are in Bi,j(1).} 1.2: call Concentration(d, pid, 0);

Step 2: {Set the k destination addresses as dummy packets to be concentrated.} 2.1: for all PEi(0≤ i <k/n) do in parallel

2.1.1: save their destination addresses in Bi(1) and Bi(0); 2.2: call Concentration(d, pid, 0);

{The k dummy packets are also concentrated in the topmost k processors.} 2.3: for all PEi(0≤ i <k/n) do in parallel

2.3.1: transfer these routing tags set by the k dummy packets to the k dummy packets in a one-to-one basis;

Step 3: {Route the k packets to their destinations through the indirect binary cube by using the known routing tags obtained in Step 2.}

End {of Superconcentration_on_Hypercube}

It is easy to see that the above superconcentration algorithm can superconcentrate any pattern of inputs to any equivalent number of outputs in O(n) time.

(9)

4. PARTIAL PERMUTATION ROUTING

Now we apply the concentration algorithm described in the previous section to the PPR problem in hypercube interconnection networks. The essence of the PPR algorithm is based on the following observation. If we concentrate the input packets according to their destination addresses in a binary-radix manner from MSB to LSB (least significant bit) then these packets will reach their destinations after n iterations. As a consequence, two steps are required in each iteration: One for processing packets with bk= 0, and the other for packets with bk= 1. More precisely, for each iteration k, where 0≤ k ≤ n − 1, two phases are required for processing the input packets stored in buffers Bi(1), where 0 ≤ i ≤ 2n_{− 1. The input packets being processed are stored in buffers Bi}

(0). The detailed algorithm is described as follows.

Algorithm: PPR_on_Hypercube(n) {All packets to be routed are stored in Bi(1).} Begin

Step 1: for all PEi0≤ i < 2ddo in parallel 1.1: transfer Bi(1) to Bi(0);

Step 2: for k = n− 1 downto 0 do

2.1: (Phase I) {Processing the packets with bk= 0.} 2.1.1: for all PEido in parallel

if bit bkof the destination address of the packet in Bi(0) is 0 then set (pi, si)(1) = (1, 1);

2.1.2: for all s = 0 to 2n-k-1− 1 parallel call Concentration(k + 1, s× 2k+1, 0);

2.2: (Phase II) {Processing the packets with bk= 1.} 2.2.1: for all PEido in parallel

if bit bkof the destination address of the packet in Bi(0) is 1 then set (pi, si)(1) = (1, 1);

2.2.2: for all s = 0 to 2n-k-1− 1 parallel call Concentration(k + 1, s× 2k+1, 1); End {of PPR_on_Hypercube(n)}

An example of the operations of the general routing paradigm for PPR_on_Hy-percube(8)algorithm is shown in Fig. 6. In phase I, all packets with bk= 0 (k = 2) are processed. In phase II, all packets with bk= 1 (k = 2) are processed. The objective of the algorithm is to separate and route packets with bk= 0 (k = 2) and bk = 1 (k = 2) to the processors in which the identifiers have the same bit values (i.e., bk= 0 and bk= 1). From the figure, it is easy to verify that the general paradigm indeed carries out the radix sort-ing operation based on each binary digit of the destination addresses of input packets.

The running time for the PPR_on_Hypercube(n) algorithm is O(n2) since the Concentration(d, pid) algorithm needs O(n) time, and the time complexity of the outer loop ofPPR_on_Hypercube(n)algorithm is also O(n).

(10)

Fig. 6. An example of PPR problem on the general routing paradigm.

5. MULTICASTING

In this section, we apply the general routing paradigm to the multicasting problem on hypercube networks.

5.1 General Scheme

To multicast a packet, it is necessary to know the destinations in advance. Thus, it is often to assumed that an additional N-bit multicasting tag, shown in Fig. 7, is applied to each packet to determine whether the packet should be duplicated to both halves of the distributor, routed to the upper half of distributor only, routed to the lower half of dis-tributor only, or just discarded at this stage. If an inlet on hypercube networks must spec-ify the outlets to which it is to be mapped independently of other inlets, then it must save these outlet addresses as many as N log2N bits, since the size of each outlet address is

log2N bits and a packet in any inlet might be multicast to as many as N outlets, but

actu-ally, N bits is sufficient to specify the outlets to which it is to be mapped.

Let bN-1… b1b0be a multicasting tag of a packet at an inlet, where bi= 1 if and only if this packet is to be transmitted to outlet i. Using this N-bit multicasting tag, we can generate a (2N − 2)-bit duplication code, u andij ,

j i

l shown in Fig. 8(b), where u =ij

bN/2-1OR … OR b1OR b0and

j i

l = bN-1OR … OR bN/2+1OR bN/2at the stage j = 0, and

j i

u and lij, 0≤ i ≤ 2

j_{− 1, 0 ≤ j ≤ lg N − 1, are the duplication code of the i-th distributor} from top to bottom at the j-th stage from left to right. It is obvious that the packet should be routed to the upper half of the distributor if and only if bN/2-1…b1b0are not all zero, or

equivantly u = 1, and it should be routed to the lower half of the distributor if and onlyij

if bn-1… bN/2+1bN/2are not all zero, or equivantly

j i

l = 1. Repeating this coding of multi-casting tag from the first stage to the last stage, we can generate a (2N− 2)-bit

duplica-tion code, 32 .... 2 3 2 2 2 2 2 1 2 1 2 0 2 0 1 1 1 1 1 0 1 0 0 0 0 0l u l ul u l u l u l u l u

(11)

Fig. 7. An example of multicasting tag.

(a)

(b)

Fig. 8. (a) Multicasting tag indicates that the packet must be duplicated and routed to the outlets 001, 011 and 110, (b) Duplication codes.

5.2 Multicasting on Hypercubes

Each packet now can carry its N-bit multicasting tag and perform the OR operations in each stage through the entire network or preprocessed (2N− 2)-bit duplication code to determine the number of duplicates and reduce the logic operating time in each stage. A complete description of the latter multicasting scheme for a 8-inlet hypercube network is shown in Fig. 9. Packet A at inlet 0 wants to be multicatesd to the outlets 1, 3 and 6. It will generate a broadcast tag as indicated in Fig. 8(a) and a 14-bit duplication code, 11/1101/01010010. The first two bits before the first slash indicate packet A and its re-maining 12-bit duplication code must be duplicated and routed to both halves at stage 0 since its destination outlets belong to both halves. Repeating this duplication scheme from the first stage (stage 0) to the last stage (stage 2j− 1), the routing of one-to-many assignments for the hypercube network in Fig. 9 is straightforward.

(12)

Fig. 9. An example of multicasting on the general routing paradigm.

Theorem 1Using additional multicasting tags and coding schemes, the network in Fig. 9 can multicast any pattern of packets on a self-routing basis.

6. LOAD BALANCING PROBLEM

For our purpose, load balancing can be defined as L tasks distributed over N proc-essors with no more than M tasks assigned to any single processor, where L/N≤ M ≤ L and N = 2n. To make the best use of the system processor resources, the workload differ-ences between any two processors must be at most only one at any time. It is well under-stood that distributed computer systems have the potential to improve performance and resource sharing [7]. However, one major problem for such systems is that it is possible for some processors to be heavily loaded while others are lightly loaded or even idle. To maximize the performance of such systems, it is necessary to keep every processor busy as long as some tasks are waiting for service in the systems. Two distinct strategies have been proposed [18, 23, 24] for this purpose. Load balancing algorithms explore the pos-sibility of equalizing the workload among the processors, while load sharing algorithms simply attempt to assure that no processor is idle while some tasks are waiting for service.

In general, load balancing algorithms require much more resources from the sys-tems than load sharing algorithms [12]. Therefore, the resource requirement may out-weigh the potential benefits in the underlying systems if we do not have good enough load balancing algorithms.

Most recent highly parallel computer systems utilize the wormhole routing schemes [16]. In such systems, the communication overhead is largely on link contention, while the variation due to distance between two processors depends negligible [1]. Hence, the effectiveness of the load balancing problem lies in finding a match between heavily loaded processors and lightly loaded processors so that link contention can be avoided.

(13)

Bokhari [1] proposed a network flow model for load balancing in such systems. However, the resulting time bound is O(N2) for meshes and ( log22 )

2 N N

O for hypercubes.

To improve the performance of load balancing algorithms, many dynamic load bal-ancing strategies have been proposed [19, 26, 28]. These include sender initiated diffu-sion [3], which uses the load information of near-neighbor processor to migrate surplus load from heavily loaded processors to lightly loaded neighbors in the system, receiver initiated diffusion, which is the converse of the sender initiated diffusion, hierarchical balancing method, which is an asynchronous global approach and organizes the system into a hierarchy of subsystems, gradient model [15], which employs a gradient map of the proximities of underloaded processors in the system to guide the tasks migration be-tween overloaded and underloaded processors [26], and dimension exchange method [3], which is a global and fully synchronous approach.

The major problem with the above approaches is that all of them are built on the underlying machine architectures. Thus, the load balancing algorithms increase not only system workloads but also the link contentions. In this paper, we will propose a self-routing superconcentrator-based load balancing approach. Using this scheme, the underlying system can leave the load balancing problem entirely to the superconcentrator. Hence, the performance improvement is twofold. First, the superconcentrator takes care of all the tasks required for load balancing so that it is transparent to the system. Second, the superconcentrator provides all communication paths for task migration required for the load balancing process; hence it does not creat any additional traffic on the link be-tween any two processors of the underlying systems. The only price to be paid is that it requires an O(N) cost superconcentrator, and each computer requires two additional I/O ports, one for input and one for output, which are connected to the superconcentrator.

Load balancing can be considered to be an operation that searches for appropriate pairings among processors that are heavily loaded and those that are lightly loaded. Three issues are intimately related to load balancing. They are: (1) load difference evaluation, that is, how to classify the processors as overloaded or underloaded; (2) mapping be-tween overloaded and underloaded processors; and (3) redistribution of loads among the processors. Of course, the communication overhead associated with load transfers de-pends on the communication mechanisms supported by the underlying parallel computer and must be minimized. For convenience, in this paper we assume that the basic work-load unit is a task and all tasks are independent, that is, they can be assigned independ-ently to any processor and obtain the same result.

Tree and mesh architectures are considered to be two of the most promising candi-dates for highly scalable parallel multicomputer systems [9]. A tree architecture consists of N = 2l leaf processors with an l level balanced binary tree as its communication mechanism. An 8-leaf tree parallel computer example is shown in Fig. 10(a). As for the load balancing problem on an N-node tree architecture, the worst case occurs when the left half of N/2 processors each with exactly M tasks, transfers a total of (M/2)(N/2) tasks to the right half of empty N/2 processors. The resulting communication overhead is O(MNlog2N) if a pipelined packet routing scheme is used.

An N-node mesh architecture consists of a N× N two-dimensional array. Each node has a degree of 2, 3, or 4, depending on wether a node is a boundary node or an interior node. An example of 16-node 4× 4 mesh is depicted in Fig. 10(b). For the load balancing problem on an N-node mesh architecture, the worst case occurs when the left half of N/2 processors, each with exactly M tasks, transfers a total of (M/2)(N/2) tasks to

(14)

(a) An eight-leaf tree parallel computer. (b) A 16-node mesh parallel computer. Fig. 10. Examples of tree and mesh parallel computers.

the right half N/2 empty processors. The resulting communication overhead is O(M N) if a pipelined packet routing scheme is applied.

In this section, a load balancing algorithm based on the superconcentration algo-rithm described in the previous section is explored.

6.1 Load Balancing Algorithm

Assume that the ith processor Pihas litasks. The algorithm is composed of three steps: computation of the load differences, distribution, and redistribution. The load bal-ancing algorithm begins with step 1 by calculating the sum, average, and difference in the loads between the average and liof each processor [1-3, 27].

Step 2 distributes the overloaded tasks from the overloaded processors to the under-loaded processors. In this step, the processors are identified by the load difference be-tween the actual load and the average. If the load difference DIF(Pi) = li − AVG, where_AVG₌

 

_NL _{, of processor P}

iis positive, then the processor is overloaded, denoted byPiOL. If the load difference is negative, the processor is underloaded, denoted by .

UL i

P In this step, each overloaded processor sends at mostDIF(PiOL) tasks to some under-loaded processors, and each underunder-loaded processor receives at least |DIF(PiUL)| tasks from some overloaded processors.

Let BW(t) denote the minimum among the number of overloaded processors and underloaded processors at time t. BW(t) is the maximum number of tasks that can be su-perconcentrated from the overloaded to the underloaded processors through supercon-centration at time t. Tasks are superconcentrated from the top BW(t) overloaded proces-sors to the top BW(t) underloaded procesproces-sors. A processor Piis said to be filled or bal-anced when its load difference DIF(Pi) is zero. Unless specified otherwise, a processor with a zero load difference will remain neutral, and will not participate in the supercon-centration process. The superconsupercon-centration of tasks from overloaded processors to un-derloaded processors is repeated until BW(t) = 0.

It is possible that after all underloaded processors are filled, there are still some overloaded tasks in some overloaded processors since the_AVG₌

 

_NL _{. Therefore, one or}

more processors may still be overloaded by more than one task. In order to achieve load balancing with the number of tasks between any two processors differing by at most one, the third step, redistribution, is required. In the redistribution step, overloaded processors

(15)

and underloaded processors are reassigned new specifications. First, call processor Pi, which has DIF(Pi) = 0, an underloaded processor, ,

UL i

P which will accept at most one task from some overloaded processor. A processor Piwhich has DIF(Pi) = 1 is considered neutral and will remain idle in the redistribution step. Processor Pithat has DIF(Pi) > 1 is considered overloaded and will send DIF(Pi) − 1 tasks to the underloaded processors. This step is repeated through the superconcentration process until the load is balanced.

The complete load balancing algorithm based on superconcentration is described as follows.

Algorithm: Load Balancing Begin

Step 1: Compute load differences. (Let_AVG₌

 

_NL _{and DIF(P}

i) = li− AVG, respec-tively.) do in parallel

1.1: Compute L by summing all tasks on the processors in the hypercube in-terconnection network to the root processor, P0.

1.2: Compute AVG at the root processor, P0, and broadcast AVG to every

processor in the hypercube interconnection network.

1.3: Compute DIF(Pi) at each processor in the hypercube interconnection net-work.

Step 2: (Distribution) do in parallel

2.1: Identify processors with positive, negative, or zero DIF(Pi) as overloaded processors, underloaded processors, or balanced processors, respectively. 2.2: Superconcentrate one task from each of the top BW(t) overloaded

proces-sors to the top BW(t) underloaded procesproces-sors at time t through the super-concentration.

2.3: Update the load differences. The load differences of the top BW(t) under-loaded processor and the top BW(t) overunder-loaded processor processor are increased and decreased by one, respectively, after the superconcentration. 2.4: Go to Step 2.2 until BW(t) = 0. (that is, DIFs of all underloaded processors

are equal to zero.)

Step 3: (Redistribution) If DIF(PiOL)>1, 0≤i≤N−1 then do in parallel

3.1: Reassign processors with DIF greater than one, equal to zero, or equal to one, overloaded processors, underloaded processors, or balanced (idle) processors, respectively.

3.2: Repeat Step 2.2 and Step 2.3 until all load differences of the overloaded processors are equal to one’s.

END {of Load Balancing} 6.2 Correctness and Performance

Some lemmas are required for proving the correctness and estimating the time com-plexity of the load balancing algorithm.

Lemma 1The absolute value of the maximum load difference |DIF(Pi)| of any processor is M− 1.

(16)

Lemma 2 The superconcentration of overloaded tasks in the distribution step can be completed in O(Mn) time, in which the underloaded processors are filled.

Proof:LetDIFmax(PiOL)and ( ' )

UL i max P

DIF denote the maximum load difference of over-loaded processor Piand underloaded processor Pi’, respectively. By specifying BW(t) = ( the minimum number of overloaded processors or underloaded processors at time t ), we see that for each superconcentration process the load differenceDIFmax(PiOL)and

) ( '

UL i max P

DIF will decrease and/or increase by one, respectively. In addition, the maxi-mum load difference DIF of any processor is M− 1 by Lemma 1, and the routing time through the superconcentration process requires O(n) time. Therefore, the total time re-quired to superconcentrate overloaded tasks from overloaded processors to underloaded processors so that BW(t) reaches zero is at most 2(M− 1)n = O(Mn) time units. Lemma 3 The sum of the load differences of overloaded processors is at most N− 1 after the distribution step.

Proof: Let E be the sum of the load differences of overloaded processors after the distri-bution step. The maximum value of the load difference of overloaded processors after the distribution step is limited by

  



 

_NE

N E AVG N N L _AVG AVG= = × + = + . Therefore, 0≤ E ≤

N−1 and the proof is completed.

Lemma 4 The sum of the overload tasks of overloaded processors will be less than the number of underloaded processors after the completion of step 3.1.

Proof: From the step 3.1, the processors withDIFgreater than one are assigned as over-loaded processors, the processors withDIFequal to zero as underloaded processors, and the processors withDIFequal to one as balanced processors. Furthermore, by Lemma 3, the sum of the overload tasks of overloaded processors is at mostN−1 after the distribu-tion step (that is, step 2); the sum of the overload tasks of overloaded processors will be less than the number of underloaded processors after the completion of step 3.1. Other-wise, the sum ofDIF(Pi)s of all processors would be greater thanN−1, which

contra-dicts Lemma 3.

Lemma 5 The redistribution step needs at most (M−1)ntime units.

Proof: There are at mostN−1 overloaded tasks in one or more overloaded processors with no more than M tasks assigned to each processor after the distribution step by Lemma 3. In the redistribution step,BW(t) would be the number of overloaded proces-sors by Lemma 4. Since 0≤|DIF(Pi)|≤ M − 1 andntime units are required for routing

overloaded tasks through the superconcentration, the time required for performing the redistribution step is therefore at most (M−1)n. Theorem 2 The load balancing based on superconcentration can be achieved inO(Mn) communication and computation time, where each processor has a maximum ofMtasks. Proof: First, we prove the correctness of the algorithm. By Lemma 3 there are at mostN

−1 tasks in the overloaded processors after the distribution step. Hence the distribution step alone will not be sufficient to achieve load balancing. The redistribution step is

(17)

re-quired to achieve load balancing. By Lemma 4, there are more underloaded processors than the sum of overload tasks from overloaded processors at the beginning of step 3.2. Therefore, it is sufficient to conclude that the load balancing will be accomplished after the redistribution step with the number of tasks between any two processors differing by at most one.

Next, we estimate the time complexity of the algorithm. It takesO(n) time units to compute sumLand the load differences, and if takes a constant time for the averageAVG for each processor. Therefore, step 1 requires onlyO(n) time units for both communica-tion and computacommunica-tion.

By Lemma 2, the superconcentration process between overloaded processors and underloaded processors will takeO(Mn) communication and computation time to com-plete. By Lemma 5, the redistribution step takes at most an additional (M − 1)n time units.

Combining the above results, the theorem is proved.

7. CONCLUSIONS

In this paper, we proposed a few basic algorithms and then applied them to the con-centration, superconcon-centration, load balancing, multicasting and PPR problems on hy-percube interconnection networks. The result shows that both concentration and super-concentration problems can be solved inO(n) time and the multicasting and PPR prob-lems inO(n2) and requires only two buffers for each node. Finally, the superconcentra-tion along with the basic algorithms are applied to the load balancing problem. The result shows that this problem can be solved inO(Mn) time, where nis the dimension of hy-percube interconnection networks andMis the maximum number of tasks in each node.

REFERENCES

1. S. H. Bokhari, “A network flow model for load balancing in circuit-switched multi-computers,”IEEE Transactions on Parallel and Distributed Systems, Vol. 4, 1993, pp. 649-657.

2. B. S. Chlebus, J. D. P. Rolim, and G. Sluutzki, “Distributing tokens on a hypercube without error accumulation,” in Proceedings of the 1996 International Parallel Processing Symposium, 1996, pp. 573-578.

3. G. Cybenko, “Dynamic load balancing for distributed memory multiprocessors,” Journal of Parallel Distributed Computing, Vol. 7, 1989, pp. 279-301.

4. F. R. K. Chung, “On concentrators, superconcentrators, generalizers, and nonblocking networks,”Bell Systems Technical Journal, Vol. 58, 1978, pp. 1765-1777.

5. T. H. Cormen, “Concentrator switches for routing messages in parallel computers,” M.S. thesis, Department of Electrical Engineering and Computer Science, MIT, 1986. 6. O. Gabber and Z. Galil, “Explicit constructions of linear sized superconcentrators,”

Journal of Computer and System Sciences, Vol. 22, 1981, pp. 407-420.

7. D. Gerogiannis and S. C. Orphanoudakis, “Load balancing requirements in parallel implementations of image feature extraction tasks,” IEEE Transactions on Parallel and Distributed Systems, Vol. 4, 1993, pp. 994-1013.

(18)

8. E. Horowitz, S. Sahni, and S. Rajasekaran, Computer Algorithms C++, New York: Compress Science Press, 1997.

9. K. Hwang,Advanced Computer Architecture: Parallelism, Scalability, Programmabil-ity, New York: McGraw-Hill, 1993.

10. G. E. (C. Y.) Jan and A. Y. Oruc, “Fast self-routing permutation switching on an as-ymptotically minimum cost network,”IEEE Transactions on Computers, Vol. C-42, 1993, pp 1469-1479.

11. D. Koppelman and A. Y. Oruc, “A self-routing permutation network,”Journal of Par-allel and Distributed Computing, Vol. 7, 1990, pp. 140-151.

12. O. Kremien and J. Kramer, “Methodical analysis of adaptive load sharing algorithms,” IEEE Transactions on Parallel and Distributed Systems, Vol. 3, 1992, pp. 747-760. 13. C. Y. Lee and A. Y. Oruc, “Design of efficient and easily routable generalized

connec-tors,”IEEE Transactions on Communications, Vol. 43, 1995, pp. 646-650.

14. F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, San Mateo, California: Morgan Kaufmann, 1992.

15. F. C. H. Lin and R. M. Keller, “The gradient model load balancing method,” IEEE Transactions on Software Engineering, Vol. SE-13, 1987, pp. 32-38.

16. D. H. Linder and J. C. Harden, “An adaptive and fault tolerant wormhole routing strategy for k-ary n-cubes,”IEEE Transactions on Computers, Vol. C-40, 1991, pp. 2-12.

17. G. A. Margulis, “Explicit constructions of concentrators,” Problems of Information Transmission, Vol. 9, 1973, pp. 325-332.

18. R. Mirchandaney, D. Towsley, and J. A. Stankovic, “Analysis of the effetcs of delays on load sharing,”IEEE Transactions on Computers, Vol. C-38, 1989, pp. 1513-1525. 19. D. Nicol and J. Saltz, “Dynamic remapping of parallel computations with varying

resource demands,” IEEE Transactions on Computers, Vol. C-37, 1988, pp. 1073-1087.

20. M. C. Pease, “The indirect binaryn-cube microprocessor array,”IEEE Transactions on Computers, Vol. C-26, 1976, pp. 458-473.

21. M. S. Pinsker, “On the complexity of a concentrator,” inProceedings of the 7th Inter-national Teletraffic Congress, 1973, pp. 318/1-318/4.

22. C. L. Seitz, “The cosmic cube,”Communications of ACM, Vol. 28, 1985, pp. 22. 23. E. Shamir and E. Upfal, “A probabilistic approach to the load-sharing problem in

distributed systems,” Journal of Parallel Distributed Computing, Vol. 4, 1987, pp. 521-530.

24. K. G. Shin and Y. C. Chang, “Load sharing in distributed real time systems with state-change broadcasts,” IEEE Transactions on Computers, Vol. C-38, 1989, pp. 1124-1142.

25. H. J. Siegel and S. D. Smith, “Study of multistage SIMD interconnection networks,” 5th Annual Symposium on Computer Architecture, 1978, pp. 223-229.

26. M. H. Willebeek-LeMair and A. P. Reeves, “Strategies for dynamic load balancing on highly parallel computers,”IEEE Transactions on Parallel and Distributed Systems, Vol. 4, 1993, pp. 979-993.

27. J. Woo and S. Sahni, “Load Balancing on a hypercube,” in Proceedings of the 5th International Parallel Processing Symposium, 1991, pp. 525-530.

(19)

28. S. Zhou, “A trace-driven simulation study of dynamic load balancing,”IEEE Transac-tions on Software Engineering, Vol. SE-14, 1988, pp. 1327-1341.

Gene Eu Jan () received a B.S. degree in electrical engineering from the National Taiwan University in 1983, and an M.S. and a Ph.D. in electrical engineering from the University of Maryland, College Park, in 1988 and 1992, respectively.

He has been an Associate Professor with the Departments of Computer Science, and Navigation, National Taiwan Ocean Uni-versity, Keelung, Taiwan since 1993.

Prior to joining the National Taiwan Ocean University, he was a Visiting Assistant Professor in the Department of Electrical and Computer Engineering at the California State University, Fresno, California. His research interests include parallel computer systems, interconnec-tion networks, and VLSI systems design.

Frank Yeong-Sung Lin () received his BS degree in electrical engineering from the Electrical Engineering Depart-ment, National Taiwan University in 1983, and his Ph.D. degree in electrical engineering from the Electrical Engineering Depart-ment, University of Southern California in 1991. After graduating from the USC, he joined Telcordia Technologies (formerly Bell Communications Research, abbreviated as Bellcore) in New Jer-sey, U.S.A., where he was responsible for developing network planning and capacity management algorithms. In 1994, Prof. Lin joined the faculty of the Electronic Engineering Department, National Taiwan University of Science and Technology. Since 1996, he has been with the faculty of the Information Management Department, National Taiwan University. His research interests include network optimization, network planning, performance evaluation, high-speed networks, wireless communications systems and distributed algorithms.

Ming-Bo Lin ()received the B.S. degree in electronic engineering from the National Taiwan Institute of Technology, Taipei, the M.S. degree in electrical engineering from the Na-tional Taiwan University, Taipei, and the Ph.D. degree in electri-cal engineering from the University of Maryland, College Park. Since February 2001, he has been a professor of the department of electronic engineering at the National Taiwan Institute of Technology, Taipei. His research interests include VLSI systems design, parallel algorithms, computer arithmetic, and fault-tolerant computing.

(20)

Deron Liang ( ) received a BS degree in electrical engineering from National Taiwan University in 1983, and an MS and a Ph.D. in computer science from the University of Maryland at College Park in 1991 and 1992 respectively. He is on the fac-ulty of Computer Science Department, National Taiwan Ocean University, Taiwan since 2001. He also holds joint appointment with the Institute of Information Science (IIS), Academia Sinica, Taipei, Taiwan, Republic of China. He was with IIS from 1993 till 2001.

Dr. Liang’s current research interests are in the areas of software fault-tolerance, system security, and system reliability analysis. Dr. Liang is a member of ACM and IEEE.