Optimal processor mapping for linear-complement communication on hypercubes

(1)

Optimal Processor Mapping for

Linear-Complement

Communication on Hypercubes

Yomin Hou, Chien-Min Wang, Member, IEEE Computer Society, Chiu-Yu Ku, and Lih-Hsing Hsu

AbstractÐIn this paper, we address the problem of minimizing channel contention of linear-complement communication on wormhole-routed hypercubes. Our research reveals that, for traditional routing algorithms, the degree of channel contention of a linear-complement communication can be quite large. To solve this problem, we propose an alternative approach, which applies processor reordering mapping at compile time. In this compiler approach, processors are logically reordered according to the given

communication(s) so that the new communication(s) can be efficiently realized on the hypercube network. It is proved that, for any linear-complement communication, there exists a reordering mapping such that the new communication has minimum channel contention. An On3_{algorithm is proposed to find such a mapping for an n-dimensional hypercube. An algorithm based on dynamic} programming is also proposed to find an optimal reordering mapping for a set of linear-complement communications. Several computer simulations have been conducted and the results clearly show the advantage of the proposed approach.

Index TermsÐHypercubes, linear-complement communication, channel contention, processor mapping, wormhole routing.

æ

1 I

NTRODUCTION

I

Na message-passing multicomputer, efficient schemes to

move messages among the processors are required for obtaining fast and efficient parallel algorithms. Many studies have been based on store-and-forward routing, where the message latency is proportional to the product of the message length and the number of routing steps. Hence, most of them have concentrated on minimizing the number of routing steps in moving messages among processors [4], [16], [17], [18], [20], [21]. On the other hand, wormhole routing has been widely adopted recently due to its effectiveness in interprocessor commu-nication [1], [2], [22]. With wormhole routing, each message is divided into a number of flits. The header flit(s) carries the address information and governs the route while the remaining flits of the message follow in a pipeline fashion. The pipelined nature of wormhole routing provides two attractions. First, in the absence of channel contention, the network latency would be relatively insensitive to the path length [9], [22]. Second, the large message buffers for each router are obviated. Only small flit buffers are required [22].

However, channel contention can have a severe impact on the network latency of wormhole routing. Channel contention happens when multiple messages

simultaneously use the same channel in their routes. If k messages are contending for the same channel, only one of them can reserve the channel and be forwarded through it. The other messages have to wait until the channel is released and then contend for it again. In the worst case, the network latency will become k times. Hence, an important issue on wormhole-routed parallel computers is to minimize channel contention between messages. Many studies have tried to improve the adaptability of routing algorithms to solve the contention problem [5], [7], [10], [11], [15], [22]. However, this approach requires extra hardware support, such as buffer space and control logics. Moreover, the complexity of adaptive routers significantly increases their interrou-ter setup delay and flow control cycle times [6]. Consequently, the claims of performance advantages in channel utilization may not be able to be balanced against losses on achievable implementation speed. For these reasons, we try to solve the contention problem by compiler approaches rather than routing algorithms at runtime.

In this paper, we focus on the problem of minimiz-ing the channel contention of linear-complement commu-nication (LCC) on wormhole-routed hypercubes [14], [23]. Linear-complement communication is a class of commu-nications where the address bits of the destination of each message are linear combinations of the address bits of its source and their complements. Many important problems, like fast Fourier transform, matrix transposition, polyno-mial evaluation, etc., can be effectively solved on parallel computers which have an efficient scheme to support this type of communications. To minimize the channel conten-tion of LCC, we adopt a new approach which applies processor reordering mapping at compile time. In the compiler approach, processors are logically rearranged according to

. Y. Hou and L.-H. Hsu are with the Department of Computer and Information Science, National Chiao-Tung University, Hsinchu, Taiwan, ROC. E-mail: {ymhou, lhhsu}@cis.nctu.edu.tw.

. C.-M. Wang is with the Institute of Information Science, Academia Sinica, Taipei, Taiwan, ROC. E-mail: cmwang@iis.sinica.edu.tw.

. C.-Y. Ku is with Avanti Technology Inc., Taipei, Taiwan, ROC. E-mail: cyku@avanticorp.com.

Manuscript received 15 July 1998; revised 27 Sept. 1999; accepted 1 Sept. 2000.

For information on obtaining reprints of this article, please send e-mail to: tpds@computer.org, and reference IEEECS Log Number 107149.

(2)

the given communications before the executable code of the given parallel program is generated. In this way, no extra data movement will be incurred. By appropriately rearranging the processors, the new communications can be efficiently realized on the hypercube network. It can be proved that, for any LCC, there exists a reordering mapping such that the new communication has minimum channel contention. An On3_{algorithm is proposed to find}

such a mapping for an n-dimensional hypercube. For a parallel program containing more than one LCC, a dynamic programming algorithm is proposed to find an optimal reordering mapping. With these results, compiler techni-ques can be used to minimize the channel contention of LCCs on hypercubes. In addition, only the e-cube routing is assumed in the proposed approach and no extra hardware is needed. Experiments based on computer simulation have been conducted and the results clearly show the advantage of the proposed approach.

The discussion in this paper can be directly applied to hypercube computers such as nCUBE 2 [1]. However, for variations of the hypercube topology, the discussion may require some modifications. For example, in SCI Origin 2000 [2], two nodes are connected to a single router instead of one. Therefore, the network topology for an Origin system with 2n _{nodes is an n ÿ 1-dimensional}

hypercube. Hence, an n-dimensional LCC on an Origin system can be viewed as performing two LCCs simulta-neously on the n ÿ 1-dimensional hypercube. A detailed discussion about performing LCCs on an Origin system can be found in [12].

Related researches for store-and-forward interconnection networks have been reported. Most of them focused on subsets of LCC, such as linear-complement permutation (LCP) and bit-permute-complement permutation (BPC). For example, Boppana and Raghavendra [4] considered LCPs on hyper-cubes, Nassimi and Sahni [20], [21] dealt with BPCs on meshes and hypercubes, and Masuyama [17], [18] dealt with BPCs on chordal rings and hypercubes. However, none of these methods can be applied to linear-complement scatter (LCS) or linear-complement gather (LCG). Lin and Wang [16] considered another set of communications represented by s

d

ÿ

-mask formalism. Although this repre-sentation scheme can encode a broad class of communica-tions, it still cannot represent an LCS or LCG with a single

s d

ÿ

-mask. All of the above researches tried to give efficient routing algorithms. Hence, they require newly designed hardware and cannot be applied to existing parallel computers. Moreover, these researches aim at minimizing the number of routing steps and are not suitable for wormhole-routed networks.

The rest of this paper is organized as follows: Section 2 introduces the notations and definitions. In Section 3, we describe the compiler approach. Some properties and an algorithm of processor reordering mapping are presented in Section 4. In Section 5, a dynamic programming algorithm is proposed for a set of LCCs. Section 6 shows the experi-mental results based on computer simulation. Finally, conclusions are given in Section 7.

2 B

ACKGROUND

When a communication is performed, messages are generated by a set of source nodes and transmitted through the interconnection network to their destination nodes. The communication latency is the interval from the time the source nodes begin to send messages until the last destination node has received the message. If some of the paths for transmitting these messages contend for the same channel, then the communication latency will be increased. The more paths contend for the same channel, the longer communication latency is required. Therefore, the maximum number of paths contending for the same channel has a severe impact on the communication latency. Let the degree of channel contention for a communication be defined as the maximum number of paths contending for the same channel. In this paper, we consider the problem of performing LCC on a hypercube computer with the e-cube wormhole routing capability. Our goal is to minimize the degree of channel contention so that LCC can be performed efficiently. In this section, the definitions and notations of these terminologies will be clarified.

2.1 The Hypercube Network

An n-dimensional hypercube is a directed graph which contains N 2n _{nodes and n 2}n _{channels. Each node}

corresponds to an n-bit binary string, bnÿ1bnÿ2. . . b1b0. We

shall use a binary vector b0b1. . . bnÿ1t to represent it. Two

nodes are connected with a pair of channels, one for each direction, if and only if their binary strings differ in exactly one bit. As a consequence, each node is incident to n other nodes through n different channels, one for each bit position. The channel from node x to x0 _{is denoted by}

(x, x0_{) and said to be at dimension k if x and x}0 _{differ in}

the kth bit position.

2.2 Linear-Complement Communication

The messages generated when performing a commu-nication operation usually can be formulated by some specific pattern. For example, in the bit-reverse commu-nication operation, the source and destination nodes of each message can be represented by x0x1. . . xnÿ1t and

xnÿ1xnÿ2. . . x0t, respectively. In what follows, we will

define the class of linear-complement communications on an n-dimensional hypercube. The addition and multiplication in this section are modulo-2, i.e., they are defined in the finite field, GF(2) [19].

Definition 1. A communication is a linear-complement communication (LCC) if there exits a binary matrix Ann

and an n-dimensional binary vector b such that, for every message with source node x, its destination node y is given by the equation y Ax b.

Definition 2. An LCC with a binary matrix Ann and an

n-dimensional binary vector b is a linear-complement permutation (LCP) if Annis nonsingular, i.e., rankA n.

Definition 3. An LCC with a binary matrix Ann and an

n-dimensional binary vector b is a linear-complement gather (LCG) if rankA < n.

(3)

Scatter, the dual operation of gather, can be implemented by simply reversing the direction of message transmission of gather. Thus, we can define linear-complement scatter as follows:

Definition 4. A communication is a linear-complement scatter (LCS), if there exists a binary matrix Ann, rankAnn < n,

and an n-dimensional binary vector b such that, for every message with destination node y, its source node x is given by the equation Ay b x.

Example 1. The bit-reverse communication operation on an 8-dimensional hypercube is a linear-complement communication, y0 y1 y2 y3 y4 y5 y6 y7 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 x0 x1 x2 x3 x4 x5 x6 x7 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 0 0 0 0 0 0 0 0 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 :

For any source-destination pair (x; y) in an LCC defined above, to obtain the address of y is equivalent to performing a linear transformation of the n-dimensional vector space on x and then adding a constant binary vector. Since there is an one-to-one correspondence between the binary matrices of size n n and linear transformations on an n-dimensional vector space over GF(2), we shall utilize this property in the following sections.

2.3 Routing Strategies

The interconnection network must allow every node to send messages to every other node. In the absence of a complete network, we need a routing algorithm to determine the path selected by a message to reach its destination. Efficient routing algorithms are critical to the performance of the interconnection network.

The e-cube routing algorithm is the simplest deadlock-free routing strategy on wormhole-routed hypercubes, which reserves the required channels in a strictly increasing order of dimensions. It allows messages to be forwarded only over channels at higher dimensions than that of the last traversed channel. Many current hypercube computers use the e-cube routing because of its simplicity and ease of implementation. However, the e-cube routing establishes only one shortest path between each pair of source-destination nodes and does not take the advantage of the flexibility provided by hypercubes.

Fully adaptive routing strategies can route messages along any of the shortest paths available in the hypercube network. Unfortunately, multiple channels are needed for a pair of neighboring nodes in order to help these strategies prevent deadlock. This means that extra hardware supports, such as buffer space and control logics, have to be added to the routers. Some partially adaptive routing strategies without the need of multiple channels have been proposed [5], [7], [11], [15]. Although they can better utilize the flexibility provided by hypercubes, the complexity of adaptive routers significantly increases the interrouter setup delay and flow control cycle times [6].

Furthermore, even though a partially adaptive routing strategy is used, our simulation results reveal that the average network latency grows much more rapidly than expected as the network throughput increases. This limits the utilization of the communication capacity of an interconnection network. For example, when the adaptive routing strategies in [7] or [15] is used, the throughput of performing bit-reverse on an 8-dimensional hypercube is always less than 1/8 of the communication capacity of the hypercube network. The main reason for the poor perfor-mance is that there exists a set of 8 source-destination pairs contending for the channel (00010000t, 00000000t). Similar conditions also happen when performing matrix-transpose or reverse-flip. The adaptive routing strategy in [11] also suffers from similar problems when performing matrix-transpose or bit-reverse.

From the above discussions, it is shown that the adaptive routing algorithms cannot appropriately solve the conten-tion problem for performing LCCs on hypercubes. Hence, in this paper, we propose a compiler approach called processor reordering mapping to minimize the channel contention. The hypercube network is assumed to support only e-cube routing. Thus, no extra hardware is needed and the proposed approach is of practical use.

3 T

HE

C

OMPILER

A

PPROACH

In the proposed compiler approach, the communications to be performed in a parallel program can be detected by compilers automatically or specified by programmers. These communications will be transformed into the matrix form for the optimization process. Next, according to the given communications, an optimal processor mapping is determined. The compiler can then generate the SPMD (Single Program Multiple Data) node program for each processor accordingly. By appropriately choosing a processor mapping, the new communications can be efficiently realized on the hypercube network.

As an example, consider executing the parallel loop show in Fig. 1a on hypercubes. The function reversei; n returns the bit-reverse of i, i.e.,

i0 2nÿ1 i1 2nÿ2 . . . inÿ2 2 inÿ1;

for the loop index,

i inÿ1 2nÿ1 inÿ2 2nÿ2 . . . i1 2 i0:

If array elements bi and ri are distributed on processor Pi,

and the task of executing iteration i is also assigned to processor Pi, then the compiler can determine that the

communication to be performed is bit-reverse and trans-form it into the matrix trans-form as shown in Example 1. The SPMD node program without processor mapping can be generated, as shown in Fig. 1b for comparison.

With processor mapping, the compiler has to determine an optimal one-to-one mapping function f according to the given communication. The function f maps virtual processor x onto physical processor x0 _{= fx.}

From the view point of virtual processors, data distribution and iteration assignment remain unchanged, i.e., array elements bi and ri are distributed on virtual processor V Piand the task of executing iteration i is also assigned to

(4)

processor V Pi is now mapped onto the physical processor

Pfi. Accordingly, the SPMD node program for each

processor can be generated, as shown in Fig. 1c. Though the communication to be performed between virtual processors is still bit-reverse, the one between physical processors is changed. It will be proved in the next section that the degree of channel contention can be reduced by appropriately choosing the mapping function f. Also, the algorithm for finding an optimal mapping function will be proposed.

Since the sending and receiving of data can be accom-plished by hardware, as shown in Fig. 1b, the only overhead of the program shown in Fig. 1c at runtime is the mapping between virtual processors and physical processors. This overhead can be minimized as array indexing operations, as shown in Fig. 1d. Apparently, this overhead is far less than the communication latency and, therefore, can be neglected. Note also that the SPMD node program, shown in Fig. 1d, is very similar to the one shown in Fig. 1b, except that the functions get_pid(), send(), and receive() in Fig. 1b are replaced by the functions v_get_pid(), v_send(), and v_receive(), respectively. This property makes code genera-tion with processor mapping much easier. In addigenera-tion to generating the node program, the compiler has to deter-mine an appropriate mapping function and set up the two mapping arrays to_virtual[] and to_physical[] for use in the node program.

Another issue arises when more than one communica-tion will be performed in a parallel program. Our approach is to search for an optimal processor mapping for all the communications. We shall propose a dynamic programming algorithm for that purpose in Section 5. Another approach is to perform processor remapping at runtime. However, it may result in significant overhead at runtime. Thus, it is not considered in this paper.

4 P

ROCESSOR

R

EORDERING

M

APPING

First, we shall show that the degree of channel contention for an LCC is directly related to the ranks of submatrices of the binary matrix A of the LCC. A submatrix of the binary matrix A is the matrix obtained from A by retaining entries in some row(s) and column(s) and deleting other entries. We shall use Ai to denote the ith row of A and Ai to

denote the ith column of A. The following defines the special submatrices and their notations to be used in this paper:

Definition 5. A submatrix of the binary matrix A obtained by retaining rows in the set R and columns in the set C is denoted as AR;C.

We shall use Lito denote the set of nonnegative integers

smaller than i. Therefore, given integer i and j, ALi;Lj is the

upper-left submatrix of A with i rows and j columns. Example 2 shows the submatrix AL3;L2 of a 4 4 matrix A.

Example 2. Given A a0;0 a0;1 a0;2 a0;3 a1;0 a1;1 a1;2 a1;3 a2;0 a2;1 a2;2 a2;3 a3;0 a3;1 a3;2 a3;3 2 6 6 4 3 7 7 5; AL3;L2 a0;0 a0;1 a1;0 a1;1 a2;0 a2;1 2 4 3 5:

Theorem 1. Given an LCC y Ax b, the maximum number of paths that contend for the same channel at dimension i, denoted as TiA; b, can be determined as follows:

TiA; b 0₂iÿrank A _Li1;Li ; otherwise:; when xi yi

Proof. For any channel

l z0z1. . . zi. . . znÿ1t; z0z1. . . zi. . . znÿ1t

at dimension i, suppose that l is in the path from node x x0x1. . . xnÿ1t to node y y0y1. . . ynÿ1t according

to the e-cube routing, then we have xixi1. . . xnÿ1t zizi1. . . znÿ1t

and

y0y1. . . yit z0z1. . . ziÿ1zit:

Since y Ax b, we have

Fig. 1. An example of a parallel loop and its corresonding SPMD node programs. (a) An example parallel loop. (b) An example SPMD node program. (c) An example SPMD node program for mapping f. (d) An alternative SPMD node program for mapping f.

(5)

y0 y1 ... yi 2 6 6 6 6 4 3 7 7 7 7 5 z0 z1 ... zi 2 6 6 6 6 4 3 7 7 7 7 5 a0;0 a0;1 a0;iÿ1 a1;0 a1;1 a1;iÿ1 ai;0 ai;1 ai;iÿ1

2 6 6 6 4 3 7 7 7 5 x0 x1 ... xiÿ1 2 6 6 6 6 4 3 7 7 7 7 5 a0;i a0;i1 a0;nÿ1 a1;i a1;i1 a1;nÿ1 ai;i ai;i1 ai;nÿ1

2 6 6 6 4 3 7 7 7 5 zi zi1 ... znÿ1 2 6 6 6 6 4 3 7 7 7 7 5 b0 b1 ... bi 2 6 6 6 6 4 3 7 7 7 7 5 :

Note that the number of paths contending for channel l is equivalent to the number of solutions of x0x1. . . xiÿ1t. According to linear Algebra, either there

is no solution or there are exactly 2iÿrankA_Li1;Li _distinct

solutions satisfying the above set of equations. Note also that no solution means no path passing channel l. Hence, for all channels at dimension i, there is no solution if and only if yi xi. In all other cases, there

exists a channel at dimension i such that exactly

2iÿrankA_Li1;Li _{paths contend for it.} _t_u

The above theorem shows that the degree of channel contention is determined only by the ranks of submatrices of the binary matrix A. As an example, we shall compute the degree of channel contention of matrix-transpose on an 8-dimensional hypercube according to Theorem 1.

Example 3. Consider the matrix-transpose communication on an 8-dimensional hypercube, y0 y1 y2 y3 y4 y5 y6 y7 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 x0 x1 x2 x3 x4 x5 x6 x7 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 0 0 0 0 0 0 0 0 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 :

According to Theorem 1, the degree of channel contention at each dimension can be computed as follows:

T0 20ÿ0 1; T1 21ÿ0 2; T2 22ÿ0 4;

T3 23ÿ0 8; T4 24ÿ1 8; T5 25ÿ3 4;

T6 26ÿ5 2; T7 27ÿ7 1:

Definition 6. The degree of channel contention of an LCC y Ax b is defined to be MAX

0inÿ1fTiA; bg and denoted by

TA; b.

To minimize TA; b, the binary matrix A must be ªchangedº and, at the same time, the LCC must be correctly performed. To meet these two requirements, we propose an alternative approach, called processor reordering mapping. In this approach, processors are logically reordered by a reordering mapping, which is a permutation of address bits of processors. It is defined formally as follows:

Definition 7. A processor mapping is said to be a linear mapping if there exists a binary matrix Qnn such that, for every

node x, it is mapped to node x0_Qx.

Definition 8. A processor mapping is a reordering mapping if it is a linear mapping and the matrix Q is a permutation matrix, i.e. each row and column of Q has exactly one 1.

Reordering mapping has the property that the neigh-boring relation of processors is kept unchanged after processors are reordered. Since the communication between neighboring processors is the most frequently used class of communications, this property provides a great advantage for practical use. In order to ensure that the LCC will be performed correctly, the communication after processor reordering mapping must be changed as shown in the following theorem:

Theorem 2. Given an LCC with a binary matrix A and a binary vector b, the new communication after the reordering mapping with a permutation matrix Q is an LCC with a binary matrix QAQÿ1 _{and a binary vector Qb.}

Proof. Fig. 2 provides a good explanation for this theorem. For a source node x in the LCC with A and b, the destination node y can be computed by the equation y Ax b. After reordering mapping by Q, we have x0_{Qx and y}0_{Qy. Since Q is nonsingular,}

we can derive x Qÿ1_x0_{. Hence, we have}

y0_{Qy QAx b QAQ}ÿ1_x0_Qb:

t u In other words, the degree of channel contention after the reordering mapping is now determined by QAQÿ1_{. By}

choosing an appropriate processor mapping Q, the degree of channel contention could be greatly reduced. In the following discussion, we will show how to find an optimal reordering mapping for an LCC. First, notice the special situation in Theorem 1 where TiA; b 0 at dimension i. It

only happens when yi xi for all x, i.e., Ai Ii and

bi 0, where I is the identity matrix. Since TiA; b 0

means no communication at dimension i, we wish to keep it unchanged when performing processor mapping. It can be accomplished by applying a reordering mapping Q, which moves the ith row of A to the last dimension before searching for an optimal mapping. Hence, the ith row and column in A will be moved to the last dimension in QAQÿ1_.

Since

QAQÿ1

Lnÿ1;Lnÿ1 ALnÿfig;Lnÿfig;

Fig. 2. The new communication after reordering mapping Q for y Ax b.

(6)

and

Tnÿ1QAQÿ1; Qb TiA; b 0;

we can just consider the submatrix ALnÿfig;Lnÿfig when

minimizing the channel contention of A. Without loss of generality, we may assume TiA; b > 0 for 0 i n ÿ 1 in

the following discussion:

Theorem 3. For any nonsingular binary matrix Ann, there

exists a permutation matrix Qnn such that

rankQAQÿ1

Li1;Li i

for 0 i n ÿ 1.

Proof. The proof of this theorem is by mathematic induction on the integer i. Obviously, it is true for i 0. Suppose that it is true for 0 i k. There must exist a permutation matrix Q such that

rankQAQÿ1

Li1;Li i

for 0 i k.

We shall prove that it is also true for 0 i k 1. Note that k 1 rankQAQÿ1 Lk2;Lk1 rankQAQÿ1 Lk1;Lk1 rankQAQÿ1 Lk1;Lk k: If rankQAQÿ1

Lk2;Lk1 6 k 1, it must be held that

rankQAQÿ1

Lk2;Lk1 rankQAQ

ÿ1

Lk1;Lk1 k:

Since A is nonsingular and Q is a permutation matrix, we have

rankQAQÿ1_{rankA n:}

This means the columns in QAQÿ1 _{are linearly}

independent. Hence, we have rankQAQÿ1

Ln;Lk1 k 1:

In other words, there must exist a row j k 1 that is linearly independent of rows 0 to k in QAQÿ1

Ln;Lk1. Let

~

Q be the permutation matrix exchanging rows j and k 1. Accordingly, ~Qÿ1_{must be the permutation matrix}

exchanging column j and k 1. We can derive rank ~QQAQÿ1_~_Qÿ1

Li1;Li i

for 0 i k, and

rank ~QQAQÿ1_~_Qÿ1

Lk2;Lk1 k 1:

Since ~QQÿ1 Qÿ1_Q~ÿ1_{, it can be derived that}

rankb ~QQA ~QQÿ1c_L_i1_;L_i i

for 0 i k 1. Since ~QQ is also a permutation matrix, this theorem is true for 0 i k 1. Therefore, by induction, this theorem is true for 0 i n ÿ 1. This completes the proof of this theorem. tu

Corollary 1. For any LCP, there exists a reordering mapping such that the new communication has no channel contention. Following the method in the proof for Theorem 3, we can design an algorithm, as shown in Fig. 3, to find an optimal reordering mapping for any LCP. The input of the algorithm, LCP_Optimizer, is an LCP y Ax b and the outputs are the optimal reordering mapping Q and the new LCP y0_Dx0_{d, where D QAQ}ÿ1 _{and d Qb. At the}

beginning of the LCP_Optimizer, D, d, and Q are set to A, b, and I, respectively. By appropriately exchanging the rows and the columns, the desired new LCP can be obtained. An important job of the LCP_Optimizer is to find a set of linear independent rows in some submatrix of D. In order to do that efficiently, a matrix V is used to find linearly independent rows in a way similar to Gaussian Elimination, and a vector R is used to mark those rows. Rj set to 1 means the jth row of some submatrix of D is linearly independent and so is the jth row of the corresponding submatrix of V . Initially, V is set to D and R is set to 0. During the processing of the LCP_Optimizer,

RankDLp;Lq RankVLp;Lq

for any (p, q).

The loop from line 5 to line 17 is the main part of the LCP_Optimizer. At the beginning of iteration i, D has already been optimal for those dimensions less than or equal to i, i.e.,

Fig. 3. The proposed algorithm for finding an optimal reordering mapping for an LCP.

(7)

RankDLr1;Lr RankVLr1;Lr r

for 0 r i, and will be optimized for dimension i 1. In the meantime, the selected linearly independent rows of VLn;Liare all in VLi1;Li and marked by R. Each of those rows

contains only one 1 and the other elements are 0. All the other rows in VLn;Li are zero rows. Hence, to find a new

linearly independent row in DLn;Li1 and VLn;Li1, we only

need to examine column i of VLn;Li1 and R as shown at

lines 6 and 7. If row j is the newly found independent row and j > i 1, it will be exchanged with row i 1 so that

RankDLi2;Li1 RankVLi2;Li1 i 1:

The required row and column exchanging operations are done from line 8 to line 13. At the end of iteration i, except the newly found independent row, all the elements in column i of VLn;Li1 will be cleared to 0 by subtracting from

the new independent row as shown at lines 14 and 15. Therefore, V will be ready to be used in the next iteration. Since each row or column operation takes On execution time, the complexity of the LCP_Optimizer can be proved to be On3_.

Example 4. Consider the matrix-transpose communication on an 8-dimensional hypercube as shown in Example 3. From Fig. 3, we can compute the optimal reordering mapping and the new LCP as follows:

Q 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 and y0 0 y0 1 y0 2 y0 3 y0 4 y0 5 y0 6 y0 7 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 x0 0 x0 1 x0 2 x0 3 x0 4 x0 5 x0 6 x0 7 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 0 0 0 0 0 0 0 0 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 :

The degree of channel contention at each dimension for the new LCP becomes

T0 T1 T2 T3 T4 T5 T6 T7 20 1:

Compared with Example 3, the degree of channel contention is greatly reduced from 8 to 1, i.e., contention-free by the reordering mapping.

Theorem 4. For any binary matrix Ann, rankAnn < n,

there exists a permutation matrix Qnn such that

i ÿ rankQAQÿ1

Li1;Li n ÿ 1 ÿ rankA

for 0 i n ÿ 1.

Proof. Since rankAnn < n, there exists a column j in A

such that column j is a linear combination of other columns. Let P be the permutation matrix exchanging rows j and n ÿ 1. Accordingly, Pÿ1 _{must be the}

permutation matrix exchanging column j and n ÿ 1. Let B PAPÿ1_{, we have}

rankBLn;Lnÿ1 rankB rankA:

In the following, we will prove that there exists a permutation matrix Q such that

i ÿ rankQBQÿ1

Li1;Li n ÿ 1 ÿ rankBLn;Lnÿ1

for 0 i n ÿ 1.

The proof is by mathematic induction on the integer i. Obviously, it is true for i n ÿ 1. Suppose that it is true for k i n ÿ 1. There must exist a permutation matrix Q such that

i ÿ rankQBQÿ1

Li1;Li n ÿ 1 ÿ rankBLn;Lnÿ1

for k i n ÿ 1.

We shall prove that it is also true for k ÿ 1 i n ÿ 1. Consider i k ÿ 1 for the matrix QBQÿ1_{. If}

k ÿ 1 ÿ rankQBQÿ1

Lk;Lkÿ1 > n ÿ 1 ÿ rankBLn;Lnÿ1;

it must be held that k ÿ 1 ÿ rankQBQÿ1 Lk;Lkÿ1 > k ÿ rankQBQ ÿ1 Lk1;Lk: That is rankQBQÿ1 Lk1;Lk ÿ 1 > rankQBQ ÿ1 Lk;Lkÿ1:

So, we can derive rankQBQÿ1 Lk;Lk > rankQBQ ÿ1 Lk;Lkÿ1; and k ÿ 1 > rankQBQÿ1 Lk;Lkÿ1:

This means that, in QBQÿ1

Lk;Lk, the column k ÿ 1 is

linearly independent of other columns and there must exist a column j, 0 j k ÿ 2, which is a linear combination of other columns. Let ~Q be the permutation matrix exchanging rows j and k 1. Accordingly, ~Qÿ1

must be the permutation matrix exchanging column j and k 1. We can derive

i ÿ rank ~QQBQÿ1_~_Qÿ1 Li1;Li n ÿ 1 ÿ rankBLn;Lnÿ1 for k i n ÿ 1, and k ÿ 1ÿrank ~QQBQÿ1_~_Qÿ1 Lk;Lkÿ1 k ÿ 1 ÿ rankQBQÿ1 Lk;Lk k ÿ rankQBQÿ1 Lk1;Lk n ÿ 1 ÿ rankBLn;Lnÿ1:

Since ~QQ is also a permutation matrix, the inequality is true for k ÿ 1 i n ÿ 1. Hence, by mathematic induction, it can be proved that there exists a permutation matrix Q such that

(8)

i ÿ rankQBQÿ1 Li1;Li n ÿ 1 ÿ rankBLn;Lnÿ1 for 0 i n ÿ 1. Therefore, iÿrankQPAPÿ1_Qÿ1 Li1;Li n ÿ 1 ÿ rankPAPÿ1 Ln;Lnÿ1

for 0 i n ÿ 1. In other words, i ÿ rankQPAQPÿ1

for 0 i n ÿ 1. Since QP is also a permutation matrix, this completes the proof of Theorem 4. tu Corollary 2. For any LCG or LCS, the new communication with the reordering mapping in Theorem 4 has minimum channel contention.

Proof. Given an LCG or LCS with binary matrix Ann, for

any reordering mapping Q, MAX

0inÿ1fi ÿ rankQAQ ÿ1

Li1;Lig n ÿ 1 ÿ rankA:

From Theorem 4, there exists a permutation matrix ~Q, such that

i ÿ rank ~QA ~Qÿ1

for 0 i n ÿ 1. Therefore, by Theorem 4, we can find an optimal reordering mapping such that the new communication has minimum channel contention. tu Similar to the LCP_Optimizer, an algorithm for LCS or LCG can be easily derived following the method in the proof for Theorem 4. It is omitted here to save the space.

5 P

ROCESSOR

M

APPING FOR A

S

ET OF

LCC

S

From the previous section, we can find an optimal reordering mapping for any LCC. However, there is probably more than one communication in a parallel program. For efficiently executing such a parallel program, we have to deal with the problem of performing a set of LCCs. An example of a set of LCCs is given below, which shows some frequently used subroutines that may be in a library for image processing, including image rotation, reflection, FFT, etc.

Example 5. Suppose a 16 16 image is distributed on an 8-dimensional hypercube such that the pixel at coordinates px; py is on the node py 16 px. Some frequently used subroutines for processing such an image are shown as follows:

1. To reflect on the diagonal, each pixel px; py has to be moved to py; px, i.e., sent from node py 16 px to px 16 py. The communica-tion on the hypercube is an LCC, y A1x b1,

which is the matrix transpose shown in Example 3. Similarly, to reflect on the line py 15 ÿ px, the required communication is y A1x 11111111t.

To rotate 90 degrees clockwise, the required communication is y A1x 11110000t. And, to

rotate 90 degrees counterclockwise, the required communication is y A1x 00001111t.

2. To reflect on a vertical line, each pixel px; py has to be moved to 15 ÿ px; py. The required communication is y Ix 11110000t_{. Similarly,}

to reflect on a horizontal line, the required communication is y Ix 00001111t. To rotate 180 degrees, the required communication is y Ix 11111111t.

3. To scale the lower-left 8 8 subimage by a factor of 2 along both of the axes, each pixel px; py in the subimage has to be sent to four new positions:

2px; 2py; 2px 1; 2py; 2px; 2py 1; 2px 1; 2py 1: The required communication is an LCS, 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 y0 y1 y2 y3 y4 y5 y6 y7 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 0 0 0 0 0 0 0 0 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 x0 x1 x2 x3 x4 x5 x6 x7 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 :

4. Consider those pixels of the image to be 256 discrete data and perform 1D FFT on these data. In addition to the neighboring communications, a bit-reverse communication is required [8], [13]. The bit-reverse commu-nication, y A3x b3, is the one shown in

Example 1.

5. To perform 2D FFT on the image, the process is to perform 1D FFT for each row of the image and then perform 1D FFT for each column of the image [13]. For performing 1D FFT on each row of the image, in addition to the neighboring com-munications, the bit-reverse communication for the rows is required. Similarly, for performing 1D FFT on each column of the image, the bit-reverse communication for the columns is re-quired. They are shown as follows:

y 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 x0 x1 x2 x3 x4 x5 x6 x7 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 0 0 0 0 0 0 0 0 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 y 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 x0 x1 x2 x3 x4 x5 x6 x7 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 0 0 0 0 0 0 0 0 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 :

(9)

Since a reordering mapping that is good for some LCCs may be harmful for others, our goal is to find a reordering mapping that is good enough for all the LCCs to be performed in a parallel program. For different applications, the objective functions of optimization may also be different. Let y Arx br, 0 r m ÿ 1 be the m LCCs

to be performed. In some applications, these LCCs are in different subroutines that will be called dynamically. Example 5 shows one of such applications. Since we cannot determine at the compile-time which and how many times the subroutines will be called, a reasonable choice is to minimize the maximum channel contention of these subroutines, i.e., minimize

MAX

0rmÿ1TAr; br:

In some other applications, those LCCs may require to be performed simultaneously. Such a situation may happen when messages are scatter-gathered among processors. Sometimes, the interference between different communica-tions may also lead to this situation. Since those LCCs are performed simultaneously, it is appropriate to minimize

MAX

0inÿ1f SUM0rmÿ1TiAr; brg:

There are other possible concerns, such as minimizing SUM

0inÿ1f SUM0rmÿ1TiAr; brg:

In this section, we propose an algorithm to find an optimal reordering mapping based on the dynamic programming approach. The concept of dynamic programming can be applied to any of the three objective functions in a similar way. Without loss of generality, we focus on the problem of minimizing

MAX

0rmÿ1TAr; br;

i.e., finding an optimal reordering mapping Q that minimizes

MAX

0rmÿ1f MAX0inÿ1fi ÿ rankQArQ ÿ1

Li1;Ligg :

Though this problem can be solved by an exhaustive search on all possible reordering matrices, the largest number of all possible reordering matrices, n!, makes such a solution unfeasible for a large hypercube computer. Therefore, we present an algorithm based on the technique of dynamic programming to reduce the search space.

Since the matrix Q is a permutation matrix, there is a one-to-one correspondence between reordering mapping and the order of address bits. We shall use RS to denote an optimal ordering of the address bits in the set S. In other words, RS defines an optimal reordering mapping for ArS;S, 0 r m ÿ 1. For example, the corresponding

order of address bits for the reordering mapping Q in Example 4 is RL8 0; 4; 2; 6; 1; 5; 3; 7. The following

theorem provides the theoretical foundation for applying dynamic programming to reduce the search space.

Theorem 5. There exists an address bit j in S such that RS ÿ fjg; j is an optimal ordering of address bits for ArS;S, 0 r m ÿ 1.

Proof. Let s be the cardinality of S. Suppose that j is the last address bit of RS. Thus, the maximum number for channel contention at dimension s ÿ 1 must be the same for RS and RS ÿ fjg; j. Since RS ÿ fjg is optimal for ArSÿfjg;Sÿfjg, RS ÿ fjg; j is optimal for ArS;S at

dimensions 0 to s ÿ 2. Therefore, RS ÿ fjg; j must be an optimal reordering for ArS;S. tu

According to Theorem 5, we can designate RS as an optimal ordering chosen from RS ÿ fig; i for all i in S and find RLn by computing RS for all subsets S Ln.

The computing of RS, for S Ln, can be performed

according to the cardinality of S in increasing order. For any S Ln, the number of searches is equivalent to its

cardinality. Therefore, we can compute the total number of searches as follows: Xn k1 k Cn k Xn k1 n Cnÿ1 kÿ1 n 2nÿ1:

It can be observed that the search space is reduced from n! to n 2nÿ1_{. Fig. 4 gives the concept for computing RL}₄_.

Since the running time for computing each case in the search space is Om n3_{, the running time for computing}

RLn is Om n4 2n. Although the running time is not

polynomial in terms of the dimension n, it is polynomial in terms of the number of processors. Hence, the algorithm is feasible even for a 16-dimensional hypercube.

Example 6. Consider the two LCCs A1and A3in Example 5.

We know that A1 is for the matrix-transpose

operation and A3 is for the bit-reverse operation on

an 8-dimensional hypercube. From the proposed dynamic programming approach, we can find that RL8 3; 4; 0; 7; 2; 5; 1; 6. Hence, we can compute

Q 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 ;

(10)

Qÿ1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 and the two new communications are

y00 y01 y02 y03 y04 y05 y06 y07 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 x0 0 x0 1 x0 2 x0 3 x0 4 x0 5 x0 6 x0 7 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 0 0 0 0 0 0 0 0 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 and y0 0 y0 1 y0 2 y0 3 y0 4 y0 5 y0 6 y0 7 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 x0 0 x0 1 x0 2 x0 3 x0 4 x0 5 x0 6 x0 7 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 0 0 0 0 0 0 0 0 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 :

After the processor mapping, the degrees of channel contention now become 2 for matrix transpose and 1 for bit-reverse instead of 8 in the original communications.

6 P

ERFORMANCE

S

TUDY

To investigate the performance improvement of the proposed approach, some experiments were carried out by simulating the network behavior of an 8-dimensional hypercube. In Section 6.1, we compare the e-cube routing and some partially adaptive routing strategies with our approach. These partially adaptive routing strategies include P-cube routing proposed by Glass and Ni [11], the routing strategy proposed by Chiu et al. [7], minimal (Min) routing proposed by Li [15], and MIXab3 and MIXbb3 routing proposed by Chen and Yihng [5]. This simulation is based on those performed in [5] and [11]. In Section 6.2, a practical application, the parallel FFT program, is performed on our simulator to show the benefit of the proposed approach.

The network architecture assumed in the simulation is described as follows: There are 256 nodes connected as an 8-dimensional hypercube. Each node consists of a processor, local memory, a router, and other supporting devices. Between two neighboring routers, there are two unidirectional channels, one for each direction. A router can communicate with its local processor through pairs of ports. A separate buffer with a slot for one flit is associated with each channel. When more than one input channel contains header flits waiting for the same available output channel, the arbitration policy is in favor

of the header flit that arrived at the router first. If a header flit in an input channel has more than one available output channel allowed by the routing strategy, the channel with the lowest dimension is selected. 6.1 Simulation for Three Communications

Network performance is significantly affected by the communications, which are application-dependent. In the following discussion, we consider three communications: Matrix-transpose, bit-reverse, and reverse-flip. They are chosen not only because they are frequently used in many scientific and engineering applications but also because they are used as tested cases in many routing algorithms, so we can compare our simulation with their results. For matrix-transpose, every node x x0x1x2x3x4x5x6x7t

sends messages to node y x4x5x6x7x0x1x2x3t. For

bit-reverse, node x x0x1x2x3x4x5x6x7t sends messages

to node y x7x6x5x4x3x2x1x0t. Reverse-flip behaves like

bit-reverse except the address bits of the destination node were complemented, i.e., node x x0x1x2x3x4x5x6x7t

sends messages to node y x7; x6; x5; x4; x3; x2; x1; x0t. All

three communications are LCPs. We may find reordering mapping for them and see how the performance can be improved.

In the simulation, processors generate messages at time intervals given by a negative exponential distribution random variable. Each message is assumed to have 20 flits, including the header (flits). A flit requires a cycle to be transmitted through a channel. The measures of interest in this section are average message latency and average sustainable network throughput. The message latency is the number of cycles spent by a message in traveling from its source processor to its destination, taking the queuing delay into account. The average network throughput indicates the average number of flits delivered per cycle per processor. It is sustainable if the number of messages queued at their source processors is small and bounded. For a given system, the average message latency, in general, grows as the throughput increases. At low throughput, the network latency is contributed mainly by the message length and the distance to travel because there is little queuing delay involved. As the throughput increases, more channel contention and longer queuing delay happened, giving rise to a higher message latency. One system exhibits better communication performance than another if it has a lower message latency for any given throughput.

Figs. 5, 6, and 7 show the simulation results of matrix-transpose, bit-reverse, and reverse-flip, respectively. In these figures, M-S denotes the simulation result using an optimal reordering mapping for the specific communica-tion, which is proved to be contention-free in Section 4, and M-G uses the reordering mapping chosen for the three communications by using the dynamic programming algorithm proposed in Section 5. After the processor mapping M-G, the degree of channel contention for matrix-transpose is 2 and it is contention-free for bit-reverse and reverse-flip.

In Figs. 5, 6, and 7, it can be observed that the routing strategies MIXab3, MIXbb3, Chiu and Min indeed improve the performance over the e-cube routing since the difference of the interrouter setup delay and flow control cycle time

(11)

are not considered in this experiment [6]. The maximum sustainable network throughput of these routing strategies is about 30 percent to 100 percent higher than that of the e-cube routing. Their message latencies are also lower for any given sustainable throughput. The P-cube routing performs quite different for the three communications. It performs well for reverse-flip but worst for bit-reverse. From these results, it can be observed that it is very difficult for a routing algorithm to perform well for all communica-tion patterns.

It can also be observed that, for any of the three communications, the network throughput of the e-cube routing is always less than 0.125, which is 1/8 of the

network capacity. As we have pointed out in Section 2, this is due to the degree of channel contention being 8 for the e-cube routing. In other words, the maximum throughput that can be achieved is approximately inversely propor-tional to the degree of channel contention.

After applying M-S and M-G, there is no channel contention for performing the three communications, except performing matrix-transpose after applying M-G. The degree of channel contention is 2 for matrix-transpose after applying M-G. Theoretically their throughput can approach 1 and 0.5, respectively. The actual value is somewhat smaller because of the queuing delay between messages generated by the same processor. As shown in Fig. 5, its value is about 0.3 instead of 0.5 for M-G. Note also that, for any given sustainable network throughput, the message latencies of M-G and M-S are far lower than that of traditional approaches.

With these results, it is obvious that our approach can greatly reduce the network latency and significantly improve the throughput for LCC. Furthermore, no extra hardware supports it and sophisticated routing strategies are needed. Only the e-cube wormhole routing is assumed in the proposed approach. Therefore, it is of practical use. 6.2 Simulation for FFT

The Fast Fourier Transform (FFT) is one of the most commonly used algorithms in digital signal processing and is widely used in applications such as image processing and spectral analysis. The purpose of this section is to investigate the benefit of the proposed approach for such a practical application.

The Discrete Fourier Transform (DFT) of an m-point discrete signal xi is defined by

Xk Xmÿ1

i0

xiWik m; Fig. 5. Simulation result of matrix-transpose on an 8-dimensional

hypercube.

Fig. 6. Simulation result of bit-reverse on an 8-dimensional hypercube.

(12)

0 k m, where Wm eÿj2=m, and j pÿ1. Direct DFT

computation requires Om2_{arithmetic operations. A faster}

method of computing the DFT is the FFT algorithm, which requires only Om lg m arithmetic operations. A more detailed analysis of FFT can be found in [8]. Fig. 8 shows an example of the flow chart of the FFT algorithm for 16 points. The FFT algorithm begins with a bit-reverse permutation of the inputs, followed by lg m stages, each stage consisting of m=2 butterfly operations. An FFT input x can be identified by a binary vector x0x1x2. . . xlg mÿ1t. In the ith

computa-tional stage, the two inputs of a butterfly operation are x0x1. . . xiÿ1. . . xlg mÿ1t and x0x1. . . xiÿ1. . . xlg mÿ1t. We

can exploit some properties of the FFT algorithm to produce an efficient parallel algorithm. The Parallel_FFT algorithm is described in the following paragraph.

Let the number of data points m 2n2d_{, where n is the}

dimensions of the hypercube network, and d is a positive integer. Each input

x x0x1. . . xdxd1. . . xndÿ1xnd. . . xn2dÿ1t

is assigned to processor xdxd1. . . xndÿ1t. Hence, the

m inputs are distributed on the 2n _{processors with}

block-cyclic distribution. There are 2d _{inputs in each block and 2}d

blocks in each processor. The Parallel_FFT algorithm follows the 1 lg m stages in the FFT algorithm. In the first stage, a bit-reverse communication between processors is required for completing a bit-reverse permutation of the inputs. In the following n 2d computational stages, the first and the last d stages do not require communication operations and a neighboring communication is needed for each of the other n stages. The reasons are explained as follows: In the ith computational stage, 1 i d, the two inputs of a butterfly operation,

x0x1. . . xiÿ1. . . xd. . . xn2dÿ1t

and

x0x1. . . xiÿ1. . . xd. . . xn2dÿ1t;

are in the same processor, xdxd1. . . xndÿ1t. Hence, the

data required for computation are all in local memory and no communication is needed. Similarly, no communication is needed in the ith computational stage,

n d 1 i n 2d:

In the computational stage i, d 1 i n d, the two inputs of a butterfly operation,

x0x1. . . xd. . . xiÿ1. . . xndÿ1. . . xn2dÿ1t and x0x1. . . xd. . . xiÿ1. . . xndÿ1. . . xn2dÿ1t are in processors, xd. . . xiÿ1. . . xndÿ1t and xd. . . xiÿ1. . . xndÿ1t;

respectively. Therefore, a neighboring communication at dimension i ÿ d 1 is performed. After these 1 lg m stages, the FFT outputs can be obtained and are also distributed on the 2n _{processors with block-cyclic}

distribution as inputs. The Parallel_FFT algorithm re-quires Om=2n_{lg m arithmetic operations and n 1}

communication operations for each processor.

By processor reordering mapping, the channel conten-tion of the bit-reverse communicaconten-tion in the first stage of the Parallel_FFT algorithm can be obviated and those neighboring communications will stay unchanged. To see the performance improvement of processor mapping for the Parallel_FFT algorithm, we simulated the algorithm on an 8-dimensional hypercube. Each input of FFT is a complex number which consists of two double precision floating-point numbers, one for the real part and the other for the imaginary part. The characteristics of the hypercube computer are based on nCUBE 2. The software latency is about 164 s for a message. The time required for transmitting one byte through a channel is about 0.57s. A butterfly operation requires about 5.12 s provided that the value of Wi_{can be found in a precomputed table. If only}

half of a butterfly operation is performed in a processor, about 4.47 s is required.

The simulation results are shown in Table 1. The computation time and neighboring communication time are the same for all the routing strategies and are not changed after processor reordering mapping. Bit-reverse communication time is the most important part in this comparison. We can observe that the performances of those partially adaptive routing strategies are not as good as expected and even worse than e-cube routing. This situation may be caused by the channel contention between the neighboring communications and the bit-reverse commu-nication since there is no barrier synchronization in the program. The benefit of processor reordering mapping can be easily observed here because the bit-reverse commu-nication time is greatly reduced.

(13)

Table 2 shows the speedup after applying the processor reordering mapping. The ideal speedup of the bit-reverse communication time is 2n=2ÿ1 _{for an n-dimensional}

hyper-cube because the degree of contention is decreased from 2n=2ÿ1 _{to 1. However, the speedup in the simulation result}

is smaller than the ideal speedup due to the effect of the software latency. When the number of data points m is small, the software latency dominates the communication time. Hence, the performance improvement is not evident. As m increases, the message size also increases, and the effect of channel contention becomes more critical for the communication time. Therefore, the performance improve-ment of processor mapping becomes more significant as m increases. From Table 2, it can be observed that the speedup of the bit-reverse communication time reaches 6.43 for m 214_{. The overall execution time is about 41 percent}

faster after by applying processor mapping.

7 C

ONCLUSIONS

In this paper, we address the problem of minimizing the maximum number of paths contending for the same channel when performing LCC on e-cube wormhole-routed hypercubes. A new approach, called processor reordering mapping, is proposed to solve this problem. We have proved that, for any LCC, there exists a reordering mapping such that the new communication after processor reorder-ing has minimum channel contention. An On3_algorithm

is proposed to find such a mapping for an n-dimensional hypercube. As for a set of LCCs, an algorithm based on

dynamic programming is proposed to search for an optimal reordering mapping. It can greatly reduce the search space and thus is feasible even for a large hypercube computer. Simulation results clearly show significant performance improvement provided by the proposed approach when compared with partially adaptive routing strategies. With these results, compiler techniques can be used to reduce the message latency without the need of extra hardware costs.

A

CKNOWLEDGMENTS

The authors would like to thank the anonymous referees for their helpful suggestions. This research is supported in part by the National Science Council, Republic of China, under Grant NSC87-2213-E-001-005.

R

EFERENCES

[1] nCUBE 2 Supercomputers Manual. NCUBE Company, 1990. [2] Origin2 Servers Technical Report. Silicon Graphics, Inc. 1998. [3] S. Abraham and K. Padmanabhan, ªPerformance of the Direct

Binary n-Cube Network for Multiprocessors,º IEEE Trans. Computers, vol. 38, no. 7, pp. 1000-1011, July 1989.

[4] R. Boppana and C.S. Raghavendra, ªOptimal Self-Routing of Linear-Complement Permutations in Hypercubes,º Proc Fifth Distributed Memory Computing Conf., pp. 800-808, Apr. 1990. [5] H.L. Chen and H.S. Yihng, ªGeneralized Wormhole Routing

Strategies in Hypercubes,º J. Information Science and Eng., vol. 10 pp. 387-341, 1994.

[6] A.A. Chien, ªA Cost and Speed Model for k-ary n-cube Wormhole Routers,º IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 2, pp. 150-162, Feb. 1998.

[7] G.M. Chiu, S. Chalasani, and C.S. Raghavendra, ªFlexible, Fault-Tolerant Routing Criteria for Circuit-Switched Hypercubes,º Proc. IEEE 11th Int'l Conf. Distributed Computing Systems, pp. 582-589, 1991.

[8] T.H. Cormen, C.E. Leiserson, and R.L. Rivest, Introduction to Algorithms. MIT Press, 1990.

[9] W.J. Dally, ªPerformance Analysis of k-ary n-cube Interconnection Networks,º IEEE Trans. Computers, vol. 39, no. 6, pp. 775-785, June 1990.

[10] W.J. Dally and C.L. Seitz, ªDeadlock-Free Message Routing in Multiprocessor Interconnection Networks,º IEEE Trans. Compu-ters, vol. 36, no. 5, pp. 547-553, May 1987.

[11] C.J. Glass and L.M. Ni, ªThe Turn Model for Adaptive Routing,º J. ACM, vol. 41, no. 5, pp. 874-902, Sept. 1994. [12] Y. Hou, C.-M. Wang, and L.-S. Hsu, ªOptimal Processor Mapping

for Linear-Complement Communication on Hypercubes and Their Variations,º Technical Report TR-IIS-00-014, Inst. Informa-tion Science, Academia Sinica, Taiwan, R.O.C., Nov. 2000. http://www.iis.sinica.edu.tw/LIB/TechReport/tr00014.ps.gz. [13] R.C. Gonzalez and R.E. Woods, Digital Image Processing.

Addison-Wesley Inc. 1992.

[14] F.T. Leighton, Introduction to Parallel Algorithm and Architectures: Arrays, Trees, Hypercubes. San Mateo, Calif.: Morgan Kaufmann, 1992.

[15] Q. Li, ªMinimum Deadlock-Free Message Routing Restrictions in Binary Hypercubes,º J. Parallel and Distributed Computing, vol. 15, no. 2, pp. 153-159, 1992.

[16] F.C. Lin and F.H. Wang, ªMinimum Deadlock-Free Message Routing Restrictions in Binary Hypercubes,º J. Parallel and Distributed Computing, vol. 29, no. 2, pp. 27-42, 1995.

[17] H. Masuyama, ªAlgorithms to Realize an Arbitrary BPC Permuta-tion in Chordal Ring Networks and Mesh Connected Networks,º IEICE Trans. Inf. Syst. (Japan), vol. E77-D, no. 10, pp. 1118-1129, Oct. 1994.

[18] H. Masuyama, Y. Morita, and E. Masuyama, ªA Realization of an Arbitrary BPC Permutation in Hypercube Connected Compu-ter Networks,º IEICE Trans. Inf. Syst. (Japan), vol. E78-D, no. 4, pp. 428-435, Apr. 1995.

[19] R.J. McEliece, Finite Fields for Computer Scientists and Engineers. Kluwer Academic, 1987.

TABLE 2

Speedup After Optimal Reordering Mapping TABLE 1

(14)

[20] D. Nassimi and S. Sahni, ªAn Optimal Routing Algorithm for Mesh-Connected Parallel Computers,º J. ACM, vol. 27, no. 1, pp. 6-29, Jan. 1980.

[21] D. Nassimi and S. Sahni, ªOptimal BPC Permutations on a Cube Connected SIMD Computer,º IEEE Trans. Computers, vol. 31, no. 4, pp. 338-341, Apr. 1982.

[22] L.M. Ni and L.M. McKinley, ªA Survey of Wormhole Routing Techniques in Direct Networks,º IEEE Computer, vol. 26, pp. 62-76, Feb. 1993.

[23] Y. Saad and M.H. Schultz, ªTopological Properties of Hyper-cubes,º IEEE Trans. Computers, vol. 37, no. 7, pp. 867-872, July 1988.

Yomin Hou received the BS degree in computer science from National Chiao-Tung University, Taiwan in 1989, where he is currently pursuing a doctoral program in the Department of Computer and Information Science. His research interests include interconnection networks and parallel processing.

Chien-Min Wang received the BS and PhD degrees in electrical engineering from National Taiwan University, Taipei, Taiwan, in 1987 and 1991, respectively. He joined the Institute of Information Science, Academia Sinica, Taipei, Taiwan, as an assistant research fellow and is now an associate research fellow. His research interests include parallel compilers, parallel algorithms, parallel computer architectures, and object-oriented technology. He is a member of the IEEE Computer Society.

Chiu-Yu Ku received the BS degree in computer science from National Chiao-Tung University, Taiwan in 1989, the MS degree in computer science from National Taiwan University in 1991, and the MS degree in computer and information science from The Ohio State University, in 1998. Currently, he is working as a software engineer at Avanti Technology Inc., Taiwan. His re-search interests include computer architec-ture, parallel computing, and compiling techniques.

Lih-Hsin Hsu received the PhD degree from the University of New York at Stony Brook. He is a professor in the Department of Computer and Information Science at National Chiao Tung University. His research interests include graph algorithms, interconnection networks, and VLSI.