A basis approach to loop parallelization and synchronization

(1)

A

Basis Approach to

Loop

Parallelization and Synchronization*

Li Liu

and Ferng-Ching Lin

Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, ROC

E m a i l : f85060130csman.csie.ntu.edu.tw

Abstract

Loop transformation is a crucial step in paralleliz- ing compilers. We introduce the concept of positive co- ordinate basis for deriving loop transformations. The basis serves to find proper loop transformations to change the dependence vectors into the desired forms. We demonstrate how this approach can systematically eztract maximal outer loop parallelism. Based on the concept, we can also construct a minimal set of syn- chronization vectors, which are deadlock free, to trans- form the inner serial loops into doacross loops.

1 Introduction

Because loops usually take a large portion of computation time of a program, many researches focus on loop parallelization[ 121. Most parallelization methods use vectors to represent the dependencies and do parallelism extraction[8]. The pattern of dependence vectors decides which loop levels can be parallelized. When the number of dependence vectors is large, it may be difficult and inefficient to extract parallelism by considering all the dependence vectors.

Basic transformations, such as loop skewing and loop interchange[ll], can be applied to reconstruct the loop to make a better pattern of the dependence vectors such that more levels of the transformed loops can be parallelized. Most of the existing strategies lack a systematic way to combine the basic transformations t o produce proper loop transformations. For example, the Parafrase-2 system [6] only gives advices on which loops can be interchanged, and so on, so the users must make their own decisions on what final transformations they want. One way to combine transformations is t o take statistical measurements[9]. But, parallelism may still be extracted improperly. *This research was supported by National Science Council of

ROC under contract NSC83-0408E002-064.

If

the extracted parallelism is large enough for the number of processors, we can allocate the computations without any synchronization in each processor. Otherwise, we have to separate the iterations into more parts and different parts of iterations must be synchronized to ensure that the dependencies are preserved. We can use the static synchronization, such as wavefront method or dynamic synchronization to achieve this[ 111. Since dynamic synchronization may bring more benefit, we shall try to transform the un- paralleliied loops into DOACROSS loops. The iterations of a DOACROSS loop are executed in parallel and synchronized by special statements. Because the performance of a DOACROSS loop is significantly af- fected by the the synchronization overhead, it is im- portant to find a good set of synchronization vectors. There are three desirable properties for synchronization vectors. First, the synchronization vectors are deadlock free to prohibit deadlock. Second, the number of synchronization vectors is small t o reduce message passing. Third, the components of synchronization vectors are not too small. If a processor reaches a wait statement, it is blocked and waits for signals from other processors. If all signals have been received by a processor which reaches a wait statement, it can execute next statement without any de- lay. A synchronization vector with larger components will decrease the possibility of blocking.

In this paper, we will use the tool positive coordi- nate basis t o extract maximal outer loop parallelism and then to minimize the number of synchronization vectors for the inner loops. This pc-basis concept has been used to do loop partitioning by the present authors[4]. If we have a pc-basis for the dependence vectors, the legality of a transformation can be ver- ified on the basis. The time complexity of finding a legal transformation can thus be greatly reduced. Fur-

(2)

dent causing difficulties when we directly transform them to be in the desired pattern. Since the vectors in a pc-basis are linearly independent, to find ,a proper transformation becomes easier. When the vectors in a pc-basis are transformed to be in a desired form, so are the dependence vectors.

It is well known that a loop with outer loop parallelism is suitable for parallel computation[ 12:,6]. The pc-basis can be used t o automatically construct unimodular transformations for such requirements by systematically combining loop reversal, loop skewing and loop interchange. Then for the inner serial loops, we transform them into DOACROSS loops by adding some synchronization statements. The number of vectors in the pc-basis is equal t o the rank of the dependence vectors and we can use them to c0nstruc.t a minimal set of synchronization vectors which are deadlock free.

The rest of the paper is organized as follows. In Section 2, we use a typical example t o illustrate our approach. In Section 3, we concentrate on how to construct the pc-basis. Fundamental methods of using b s sic unimodular transformations t o modify vectors are shown in Section 4. We then apply a series of these basic transformations on the pc-basis to extract maximal outer loop parallelism in Section 5. In Section 6, we use the pc-basis to construct a minimal set of synchronization vectors and transform the inner serial loops into DOACROSS loops. These transformation constructions have been coded and added in the loop restructuring part of Parafrase-2.

2 Preliminaries and an Illustration

We assume the depth of a loop nest to be n. The set of iterations I" = {(il,iz,.

. .

,in)t

I

lk

<

z'k

5

uk, for 1

<

k

5

n }, where l k and uk are the boundaries of the i-th level loop, is called the iteration space. The execution of a statement S in iteration Ii is called the instance of S associated with I i . The set of instances of all the statements in the loop associated with Ii form a unit of computation which is denoted by Ci. Let U = ( u l , u Z , .

. .

, u , ) ~ and V = ( q , w z , .

. .

,

w , ) ~ .

If there is

k

such that uh = vh for 1 5 h

< k

and uk

<

V k , we say

U 4

v.

The relation so defined is the lexicographical order. A vector V with V

+

(O,O,

. . .

, 0 ) is said t o be lexicographically posi- tive. In the semantics of a sequential language, computations are performed by the 1exicographic.al order of the iterations. Suppose I; 4 I j and computations Ci and C j access to the same variables or amay elements, we say that C j depends on C, and I j

-

Ii is called a dependence vector.

It

is clear that every de-

pendence vector is lexicographically positive. Let the number of dependence vectors be m. The n x m matrix D n X m , whose i-th column is the i-th dependence vector, is called the dependence matrix.

Example 2.1: DO j = 1TO 5 D O i = 1TO 5 DO k = 1 T O 5 DO 1 = 1 T O 5 A ( i , j , k+2, 1-1) = C ( i , j, k , 1 )

+

B ( i , j , k , 1 ) ; A ( i , j , k , I ) = A ( i - 2 , j , k, 1-1) x B ( i , j, k , 1 ) ENDDO ENDDO ENDDO ENDDO

The iteration space of the above loop is

I4

=

{

(zl,i~,is,z4)~ I l < z k < 5 f o r l < k < 4 } . Thedepen-

dence vectors of the loop are (O,O, 2, -I)', ( 2 , 0 , 2 , O)t

and ( 2 , 0 , 0 , l)t. They form the following dependence matrlx with three columns:

0 2 2 0 0 0

- 1 0 1

D 4 x 3 =

(

2 2

.)

Let

M

be the unimodular matrix

(;

8 -p

3)

0 0

which will be constructed in Section 6. When we apply

M

on the dependence matrix, we get

The transformed dependence vectors are

all

lexicographically positive and their first two components are zero. So

M

is legal and parallelizes two outer levels of the loop. Since the outer two levels of the transformed loop are DOALL loops, they can be separated into 85 independent parts without any synchronization among them.

If

the number of processors is not more than 85, we can allocate the iterations without any synchronization t o each processor. Otherwise, we must separate the iterations into more parts and the parts must be synchronized to ensure the dependencies. Usually, an iteration is a unit of work assigned to a processor. Thus, the iterations of a loop can be parallelly executed in different processors if their dependence constraints are preserved. Let D be the set of dependence vectors. The iteration I can be executed after the iterations I - d ; , di in D , have been executed. So, the iteration I must wait for signals from the iterations I - d i , di in

D .

The vectors di in D are called the synchronization vectors.

(3)

For doubly nested loops, there is a dependence de- composition method which produces two synchronization vectors t o preserve the dependencies[8]. Each dependence vector can be represented as a linear combination of some basic vectors with positive integer coefficients. The iteration I waits for the iterations I - w ; , w i is a basic dependence vector. However, the basic dependence vectors do not guarantee the deadlock free property. When a positive linear combination of some basic dependence vectors is the zero vector, a circular wait is caused and deadlock will happen. Futhermore, the number of synchronization vectors is not minimized.

Consider the inner two loops in the previous example.

If

we take (0,0,2,0)t and

(O,O,O,

l)t as the synchronization vectors, which will be constructed in Section 6 , then

(O,O, 2 , l ) t = (O,O, 2 , 0 ) *

+

(O,O, 0 , l ) '

( 0 , 0 , 2 , 2 ) f = ( 0 , 0 , 2 , 0 ) *

+

2 x (O,O,O, l ) f

( O , O , 0 , l ) f = ( O , O , 0 , l ) '

Because these two synchrinization vectors are linearly independent, any positive linear combination of them can not be the zero vector and deadlock will not happen. Besides, the number of synchronization vectors is equal to the subspace spanned by the dependence vectors, so it is minimized.

We use two statements WaitSignal and Sendsignal

to do synchronization. The statement Waitsignal(

I , ,

I ,

)

means to wait a signal corresponding t o the synchronization vector I,-I, from the processor executing iteation I , to the processor executing iteration I,. The statement Sendsignal( I , , I ,

)

is defined similarly. The final loop obtained is shown below:

DOALL i = -14 T O 2 DOALL j = 1 TO 5 DOACR k = n t O Z ( l , - g - i ) TO min(5,3-i) DOACR 1 = m o z ( ( l - i + k ) / 2 , l + k ) T O m i n ( ( 5 - i + k ) / 2 , 5 + k ) W a i t S i g d ( ( t , j, k - 2 , I ) * , (i, j , k, ); Z)' W a i t S i g d ( ( i , j , k, 1-1)*, ( i , j , k , I ) * ) ; A ( i - k + % x Z , j , k + 2 , -k+l-1) = C(i-k+ZxZ, j, k, - k + l ) A ( i - k + Z x l , j , k , - k + l ) = A ( i - k + 2 x Z - 2 , j , k , - k + l - 1 ) S e n d S i g d ( ( i , j , k, l ) * , ( i , j , k+2, I ) * ) ; SendSignal((i, j , k , I ) * , ( i , j , k, I + l ) * )

+

B ( i - k + 2 x l , j , k, - k + l ) ; x B ( i - k + f x l , j , k, - k + l ) ENDDO ENDDO ENDDO ENDDO

We now outline our approach. First, we find a uni- modular transformation to make as many as possible upper rows of the transformed dependence matrix be

zero. The transformed loop will have maximal outer loop parallelism. Second, we find another unimodular transformation to make the nonzero entries of the transformed dependence matrix positive and the resulting dependence vectors are used to construct a set of synchronization vectors. Last, we combine these two transformations t o form a transformation to be applied on the original loop.

3 Positive Coordinate Basis

Let

M

be a linear transformation on the iteration space. It should not violate the dependence constraints of the original loop. So

M I ,

+

M I ,

and hence

M ( I ,

-

I , )

+

o

if

Ci

depends on C1. There- fore, a transformation is legal

if

all the dependence vectors are transformed into lexicographically positive vectors.

Definition 3.1: Let D = { d ;

I

1

5

i

5

m } be

the set of dependence vectors with rank r. The set of linearly independent vectors B = { b,

I

1

5

i

5

r } is a

positive coordinate basis (pc-basas) for D iffor all d , E

D,

d, =

Cy=,

c,,jb,, where c,,~

L

0.

Lemma 3.2: Let

B

= { b j

I

1

5

j

5

r } be a positive coordinate basis for the set of dependence vectors

D

= { d ,

I

1

5 i 5

m } . A transformation

M

is legal if

Mb,

+

o for all

b,

E

B.

Proof: For each d , in

D,

M d , =

M C j = , c , , , b ,

=

C;=,c,,,Mb,, c , , ~

2

0. It is clear that M d ,

+

0 .

From Lemma 3.2, the legality of M can be justified if all M b , , 1

5

j

5

r , are lexicographically positive. In the following, we will show that for a set of lexicographically positive vectors, pc-basis always exists and its vectors are also lexicographically positive. The construction of pc-basis is an iterative process.

It

finds for the first

IC

- 1 vectors a pc-basis which is then ei-

ther "extended" or "modified" t o be a pc-basis for the first

IC

vectors.

Algorithm 3.3:

(

Construct a pc-basis for lexico-

graphically positive vector set D ) B = { d i } . F O R k = Z T O I D I IF ( dk is not in Span (B) ) THEN B = B

U

{ d k } ELSE B = C h a n g e B a s i s ( B , dk ) ENDIF ENDFOR

When dk is not in the space spanned by the current basis, we simply add dk into the basis. If dk is in the space spanned by the current basis, we should use

C h a n g e - B a s i s to modify the current basis.

Let B = { b l , b z , . . .

, 4 }

be a pc- basis for { d l , d , , .

..

, d k - , } . If dk is not in the space spanned by B, then B U { d k } is a pc-basis for { d i , d z , *

.

*

,

d k } .

(4)

P r o o f : For 1

5 5

k

- 1, di = cjbj

+

Odk. dk = ~ ~ . = , O b j

+

l d k , so B U { d k } is a pc-basis for

0

Change-Basis first tests whether the coordinates of dk with respect to the current basis are all non- negative. If there is any negative coordinate, it finds a lexicographically positive vector to replace a vector in the basis until all coordinates become non-negative. { d i , d,,

. . .

2 d k } .

Algorithm

3.5:

(

ChangeBasisJ

B ,

dk

)

Solve the linear system Bnxre =

WHILE( there exists c j

<

0 ) k . Find c p

>

0.

IF(cPbp

+

cjbi is lexicographically positive) THEN bp=cpbp

+

C A ; bi=-(c,b,

+

cjbi); c p = 1, c j = 0 c p = 0, c ; = -1 ELSE ENDIF END W HILE

The while loop body in Algorithm 3.5 is to change the current basis such that the modified one is still a pc-basis for { d , , d , ,

. . .

,

d k - , } and the number of non- zero coordinates of dk is decreased. The next lemma gives the condition for the modified basis to be valid.

Lemma 3.6: Let B be a pc-basis for

D.

If another

basis B' spans the same space and, for all bi E B ,

bi =

E'

_{3 = 1}c! _:,3.b'., ₃ where c : , ~

2

0, then B' is also a pc-basis for

D.

Proof: For all dk E D , dk = c i = l c k , i b i , where c k , i

2

0. dk =

ci=1

Ck,ibi = c k , i ( c j . = 3 1 c:,jbS) where

cz=l

c k , i c i , j

2

0. Therefore B' is also a pc-

basis for D . 0

The positive coefficient c p in Algorithm 3.5 always exists because vectors in the pc-basis and dk are all lexicographically positive. If cpbp +cibi is lexicograph- ically positive, we use it t o replace 6,; otherwise, we use - ( c p b p

+

cib,) to replace bi. Such modification satisfies the precondition of Lemma 3.6 and decrease the number of non-zero coefficients.

Lemma 3.7: Let the lexicographically posi-

tive vector set

B

= { b , , b , ,

...,

b,} be a pc-basis for the lexicographically positive vector set

D

=

{ d , , d , ,

. . . ,

d k - , } and the lexicographically positive vector dk be in Spun ( B ) .

B'

= Change-Basis

(B,

d k ) = {b:, b a , .

. .

,

b:} is a pc-basis for

D

U { d k } , .where bi,

1

5 i

5

r , are all lexicographically positive.

Proof: Because dk E S p u n ( B ) , dk =

x,iz1

c j b j . We make induction on the number of non-zero cj's. When the number of non-zero coordinates is 1, dk =

c j b i for some j. By the positivity of d k and b j , c j

is positive. So

B

is a pc-basis for

D

U { d k } , we can simply take

B'

as

B.

=

e;==,

C;=l

Ck,ici,jbi = c;=1

(c;=l

ck,ic:,j)b:,

Assume the lemma holds when the number of non- zero cj's is

t.

Suppose now the number of non-zero elements is t

+

1.

If

c i

2

0 for all 1

5

j

5

r , we can just take B' as B . So we may assume that there

exists some ci

<

0. Since vectors in the current pc-

basis and dk are all lexicographically positive, there must be some c p

>

0, p

#

i.

Case 1: cpbp

+

cibi is lexicographically positive. Let bg = cpbp

+

cibi and 6; = 6, for 1

5

m

5

r ,

m

#

p . So bp = (6; - c i b y ) / c p and 6, = 6; for 1

I

m

5

r, m

#

p . We have dk = ~ j + p , i , c j b ~

+

6;. The number of non-zero coordinates of dk m

B"

is

t.

Case 2: cpbp

+

cib; is not lexicographically positive. Let by = ( - c P ) b p

+

(-ci)bi and

bk

= b, for 1

5

m

5

r , m

#

i.

So bi = (6;

+

c , b ; ) / ( - c i ) and b, = b k for 1

5

m

5

r and m

#

i. We have dk =

. . c .b'! - by. The number of non-zero coordinates

I # P > Z 3

of dk in B ~ I is t.

According t o Lemma 3.6, the new basis B" = {b:

I

1

5

m

5

r} is also a pc-basis for D and the number of non-zero coordinates of dk in B" is t. By the induction hypothesis, there exists B' which is a pc-basis for D

Example 3.8:

(

Find a pc-basis for Example 2.1

)

d,=(O, 0 , 2 , -I)', d,=(2, 0 , 2 , o)* and d3=(2, 0, 0, I)*.

0. A pc-basis for f0, 0. 2 , - l ) * : U { d k } . 0 Dependence vectors: Iterations: . . , ((0, 0 , 2 , -1)')- ( ( 0 , 0, 2 , -1)*, ( 2 , 0, 2 , O ) * ) ((0, 0, 2 , -I)*, (2, 0, 0 , 1)') 1. A pc-basis for (0, 0, 2 , -l)', ( 2 , 0 , 2 , 0 ) ' : 2 . A pc-basis for ( 0 , 0 , 2 , -l)', ( 2 , 0 , 2 , O ) ' , ( 2 , 0 , 0 , 1)': In the above example, we initialize the pc-basis to be { d , } , b,=d,. In iteration 1, we just add d , (=ba) into the basis because d , is not in the space spanned by { d l } . In iteration 2 , because d3 =

( 2 , 0 , 0 , l ) t = d , - d , is in the space spanned by {bl, b 2 } , we use ChangeEasis t o modify the basis. In here, the coordinates are 1 and - 1 . Since b, - b, is lexicographically positive, we select ba = ( 2 , 0 , 0 , l)t = b,

-

6, and b: = b,. So, b, = b: and b, = b:

+

ba. By Lemma 3.6, we know that {b:, ba} is also a pc-basis for { d l , d 2 } . Since d3

--

ba, {b:, ba}

is a pc-basis for { d i , da, d 3 ) .

4 Basic Unimodular Transformations

A linear transformation is called unimodular if its corresponding matrix is integral and has determinant

1 or -1 [l]. Let

it4

be a linear transformation. If

I'

is an iteration within the boundaries of the transformed loop indices, then M - l I ' should be in the original iteration space. (Otherwise, I' will do redundant work causing side effects.) Unimodular transforma-

(5)

tions match such requirement because their inverses are integral. Our main idea is t o apply a series of ele- mentary row operations on the dependence matrix to make the transformed dependence vectors in a suitable form.

Definition 4.1: There are three row operations to be used:

Reversal: R p , c ( A ) p , j = c A p , j , where c = -1, and R p , J A ) ; , j = Ai,? for z

#

p .

Skewing: S p , q , c ( A ) p , j = A p , j

+

c A q , j , and

S p , q , c ( A ) i , j = Ai,j for

Interchange: I p , q ( A ) p , j = A q , j , I p , q ( A ) q , j = A p , j , and These row operations provide three basic unimodular transformations which actually produce the three well-known loop transformations: loop reversal, loop skewing and loop interchange. We know that if the

i-

t h components of the dependence vectors are all zero, the i-th level of the nested loop can be executed si- multaneously. We will focus on how t o use the basic unimodular transformations to annihilate as many non-zero components of vectors as possible. The anni- hilating process of a two-dimensional vector is equiva- lent t o the Euclidean algorithm for finding the greatest common divisor [3].

Algorithm

4.2: M = I;

#

P. I p , q ( A ) i , j = Ai,j for

#

P,

9. c = ~ g DIV ~ 1 ; Apply S Z , ~ , - ~ on V and M ; Apply I 1 , 2 on v and M ENDWHILE

For more general case, we must find a unimodular matrix of higher dimension rather then two. Those methods proposed in [3] use many vector operations.

Our

strategy is to work on scalars rather than vectors.

A l g o r i t h m 4.3:

Use Algorithm 4.2 to find a unimodular matrix

M to annihilate the n-th component of U;

Find c ; , 1 5 i 5 n - 1, such that

= gcd{ui

I

1 5 i 5 n - 1);

Apply Sn,;,=, , 1 5 i 5 n - 1, on V and M ;

Apply S i , n , w , / v n , 1 5 i 5 n - 1, on 1) and M .

c;u;

Example 4.4:

(

Transform an n-dimensional vector

I

I M a k e the n-th component zero. Make the mth component the gcd of the first n-1 components. 1 0 - 3 1 0 0 - 3 1

(88

H I ) ( ; ) = ( ; )

0 0 3 - 1 1 0 -3

Make the first zero.

. ,

So far, we have shown how t o use basic unimodular transformations t o modify one n-dimensional vector to

be zero in n - 1 components. In the following section, we will make the dependence matrix have as many zero rows as possible.

5 Outer Loop Parallelism Extraction

For

outer loop parallelism, we hope to move the zero components upward and preserve the positivity in the mean time [12]. That is, each transformed dependence vector will be in the form

( O , O , .

.

, O , C Z , - ~ + ~ , d n - , + a , . .

.

,

d,)t and is lexicographically positive. This process internalizes the dependencies inward.

Lemma

5.1: Let

B

be a pc-basis for D. If the transformation

M

internalizes

B

into the innermost r levels, then

M

also internalizes the dependence vectors into the innermost r levels.

Proof: Because M internalizes the B into the innermost r levels, for any bj in B, Mbj is lexicograph- ically positive and its first n - r components are 0. For any di in D , Mdj=M ~ i , j b j = c r , ~ c i , j M b j .

This is a positive sum of lexicographically positive vectors, hence is also lexicographically positive. For 1

5

k

5

n - T , ( M d i ) k = ( M C 5 = , c , , j b j ) k = ~ ~ = l ( e i , j M b j ) k = O . So the first n - r components

of Mdi are 0. 0

Consider Example 2.1 and Example 3.8 again. We first find a pc-basis for the column vectors of DnXm, D n X m = B n x r C r x , . Then, we find a unimodular transformation M to internalize

Bnxr.

M B . , . = ( g 1 0

A

-1

8

- 2

8 )

(

H

!)=(

H

a)

0 0 - 1 1 - 1 1

M

also internalizes the dependence mat&

1 0 - 1 - 2 0 2 2 0 0 0

M D n x m = ( i

b

8 H )

(

-1

;;

0

8 )

1 =

(

-1

;;

0

8 )

1 We now show how to find the optimal unimodular transformation by way of pc-basis. The idea is to ap- ply Algorithm 4.3 on the vectors in the pc-basis one by one. The dimension of the vectors t o be processed must be decreased to ensure that the transformed vectors are not altered again.

Algorithm 5.2:

(

Construct a unimodular transformation M to annihilate the first n - r components of r linearly independent vectors b , , b z ,

.

,

b,

)

M = I ;

F O R i = t T O l last=n-r+i, v=Mb;; Annihilate the last-t Find C k such that

Apply S [ a s t , k , c k , 1

<“rk

last - 1, on V and M ;

Apply Sk,last,wk/wl,,,+ I 1 I k 5 last - 1, on V and M

component of V; d,t-1

- C k U k = gcd{uk

I

1 5 k 5 last-1);

ENDFOR

(6)

umn vectors are linearly independent. There exists a unimodular matrix

M

such that the first n - r row vectors of

MBnxr

are zero and the last r row vectors of

MBnxr

form an r x r lower triangular matrix with positive diagonal elements.

Example 5;.4:(Transform n-dimensional vectors)

1. Make the first 3 components of the second column vector eero by applyingrow operations on the first 4 rows.

(!!!-!)(

0 0 0 -1

q (

1 -1

41)

1 2. Make the first 2 components of the first column vector

zero by applying row operations on the first 3 rows.

1 0 -1 - 2

(:I! 0 0

;

!)(

-1

4 8 ) = (

1 -1

B

")

1

With Algorithm 5.2, we can find a unimodular transformation t o internalize the pc-basis into the innermost r levels. By Theorem 5.1, the transformation also internalizes the dependence vectors to produce r (which is maximal) outer parallel loops. The time complexity is O ( r n 2 ) .

Kumar and others[3] considered the construction of a legal unimodular transformation for a set of vectors which may not be linearly independent. The transformations found there may not be optimal. An algorithm for finding optimal unimodular transformations has been proposed recently in

[lo].

They use all the dependence vectors one by one to construct a valid transformation. Their algorithm has high time complexity and is for internalization only. We use pc- basis to reduce the time complexity of internalization and intend to systematize more processes of produc- ing valid transformations for more purposes, like the one presented in the following section.

6 PC-Basis for Synchronization Vec-

tors

For constructing a set of synchronization vectors with the good properties mentioned in Section 1, we introduce a special pc-basis as follows.

m} be the set of dependence vectors with rank r. The set of linearly independent vectors

B

= {bi

I

1

5 i

5

r} is a positive integer coordinate basis (pic-basis) foir D if for all di E D, d; = c;,jbj, where c;,j, 1

5

j

5

r , are non-negative integers.

Since the vectors in pic-basis are linearly independent, any positive linear combination of them can not be the zero vector. Hence, we can take pic-basis as the synchronization vectors and deadlock will never happen. Besides, the number of synchronization vec-

Definition 6.1: Let

D

= { d i

I

1

5 i

tors is minimized because it is equal to the dimension of the space spanned by the dependence vectors. It is in general hard to find the pic-basis, but

if

the dependence vectors span the iteration space and all the components of them are non-negative, pic-basis is easy to construct.

Lemma 6.2: Let

D

=

{d;

I

1

5 i 5

m} be a set of r- dimensional vectors with rank r and components of di

are all non-negative.

If

gj =

GCD

{

(d;)j

I

1

5 i

5

m

}

and e j is the r-dimensional standard basis, where

1

5

j

5

r. Then the set

B

= { b j = g j e j

I

1

5

j

5

r}

is a pic-basis for

D.

Proof: Since the rank of

D

is r and all components of vectors in D are non-negative, gj

>

0 for all 1

5

j

5

r. We have d; = ( ( d i ) j / g j ) g j e j =

E:.=,

( ( d i ) j / g j ) b j , where ( d ; ) j / g j , 1

5 i

5

m and 1

L:

j

L:

r , are all positive integers. Since all vectors in D can be represented as a linear combination of some vectors in

B

with positive integer coefficients and the number of vectors in

B

is equal to the rank of

D, B

is a pic-basis

for

D.

0

After we apply Algorithm 5.2 to internalize the dependence vectors, the first n-r components of the transformed dependence vectors and the transformed pc-basis all become zero. We may focus on their last r components only. If we can find another unimodular transformation t o make the components of the transformed dependence vectors all non-negative, the pic- basis can be constructed by Lemma 6.2 consequently.

Lemma 6.3: Let

B

be a pc-basis for

D. If

the

elements of

MnxnBnxr

are non-negative, then the elements of

MnxnDnxm

are all non-negative.

Proof: Since B is a pc-basis for D, DnXm - BnxrCrxm, where the entries of Crxm are non- negative. (MnxnDnxm)i,j = (MnxnBnXrCrXm)i,j =

-

C;;=l(MnxnBnxr)i,k(Crxm)k,j

2

0. 0

By Lemma 6.3, we can just use the transformed pc-basis t o construct a unimodular transformation to make all the components of the transformed dependence vectors non-negative. The following Algorithm is to find such a unimodular transformation. It scans the components of pc-basis dimension by dimension to decrease the number of negative entries until all become non-negative. Recall that Algorithm 5.2 trans- forms the pc-basis to be a lower triangular matrix with positive diagonal with n-r zero rows on the top. So, we just do basic unimodular transformations on the last n-r rows.

Algorithm 6.4:

(

Construct a unimodular trans-

(7)

are non-negative

)

M n x n = I ; FOR i = n-r TO n FOR J’ = 1 TO i-1 IF( b ; , j < 0 ) THEN multrplier = L - b ; , j / b j , j J

+

1; Apply S i , j , m u l t i p l i e r on M n x n and B n x r ENDIF ENDFOR ENDFOR Example 6.5:

The third row:

(,,E)(

0 0 1 0

E ) = (

p)

0 0 0 1 _. _.-1 1 -1 1 The forth row: Skewing S4,3,l(ntultiplier = 1).

~ Product ‘of the unikodular matrices obtained from

Algorithm 5.2 and Algorithm 6.4 is the final transformation we desire. For example, the transformation matrix for Example 2.1 is

1 0 0 0 1 0 -1 -2 1 0 -1 -2

( I b P 8 ) ( 8 b

_\0 0 1 1 I \ 0 0

3 ; ) = ( E :

I \ 0 0

9 ;)

I

When we apply it on the dependence vectors,

1 0 -1 -2 0 2 2 0 0 0

( 8 :

0 0

p

H ) (

-1 0 1

;E)=(;::)

1 2 1 Components of the resulting dependence vectors are all non-negaive. Consider the last two components of the resulting dependence vectors. They are two- dimensional vectors (2, l)t, (2,2)$ and (0, l)t. The dimension of the subspace spanned by them is two and their components are all non-negative. So, they meet the need of the Lemma 6.2 and the pic-basis of the resulting dependence vectors is

{

(O,O, 2, O ) t , We can take the pic-basis as the set of synchronization vectors which is deadlock free and optimal. While [8] constructs synchronization vectors for doubly nested loops only, our method is general. The synchronization vectors consturcted by (81 are in the form (1, c) and (0, l), where c may be negative. The first non-zero components of our pic-basis is related to the GCD of the dependence vectors. They are positive and not smaller than 1. (The difference between these two methods is similar to that between a serial loop and a blocking loop by the GCD method [SI.) Our method obtains more implicit parallelism.

P,O,

0,

ut

1.

7 Concluding Remarks

Positive coordinate basis is a new concept proposed by the authors[4] to attack the problems in loop restructuring. Basis is the most appropriate tool for

describing the behaviors of vectors. Using pc-basis to derive proper loop transformations is very efficient. In this paper, we have presented the constructions of two unimodular transformations, one for extracting outer loop parallelism, the other for making the none-zreo components of dependence vectors positive. We modify the pc-basis to be a pic-basis, which is suitable for optimizing the number of synchronization vectors. More applications of pc-basis are expected t o appear.

References

[l] U. Banerjee, “Unimodular Transformations of Double Loops,” Proc. Third Workshop on Programming Lan- guages and Compilers for Parallel Computing, 1990. [2] D. K. Chen and P. C. Yew, “An Emperical Study on DOACROSS Loops,” Proc. A C M International Con- ference on Supercomputing, pp. 620-632, 1991. [3] K. G. Kumar, D. Kulkarni and A. Basu, “Generized

Unimodular Loop Transformations for Distributed Memory Multiprocessors,” Proc. I991 International

Conf. on Parallel Processing, Vol. 2, pp. 146-149, 1991.

[4] L. Liu and F. C. Lin, “Two-Level Loop Partitioning Based on the Concept of Positive Coordinate Basis,” to be presented at International Workshop on Mas- sively Parallelism: Hardware, Software and Applica- tions, Capri, Italy, Oct. 1994.

[5] C. D. Polychronopoulos, Parallel Progrumming and Compilers, Kluwer Academic Publishers, Boston, 1988.

[6] C. D. Polychronopoulos, M. B. Girkar, M. R. Haghighat, C. L. Lee, B. Leung and D. A. Schouten, Parafrase-2: An Environment for Paralleling, Par- titioning, Scheduling Programs on Multiprocessors,” Proc. 1989 International Conf. on Parallel Processing, [7] A. Schreiber, Theory of Linear and Integer Program-

m i n g , John Wiley and Sons, 1986.

[8] T. H. Tzen and L. M. Ni, “Data Dependence Analysis and Uniformization for Doubly Nested Loops,” Proc. 199.2 International Conf. on Parallel Processing, Vol. [9] D. Whitfield and M. L. Soffa, uInvestigating Prop- erties of Code Transformations,” Proc. 1993 Interna- tional Conf. on Parallel Processing, Vol. 2, pp. 156- 160, 1993.

[lo]

M. E. Wolf and M. S. Lam, “A Loop Transformation Theory and an Algorithm to Maximize Parallelism,” IEEE Transactions on Parallel and Distributed Sys- tems, Vol. 2, No. 4, pp. 452-471, Oct. 1991.

[ll] M. J. Wolfe, Optimizing Supercompilers for Supercom- pers, The MIT Press, 1989.

[12] H. Zima, Supercompilers for Parallel and Vector Com- puters, ACM Press, 1990.

Vol. 2, pp. 39-48, 1989.