Adsmith: an efficient object-based distributed shared memory system on PVM

(1)

Adsmith: An Efficient Object-Based Distributed Shared

Memory System on PVM

Wen-Yew Liang

Chun-Ta King

Feipei Lai

Dept. Computer Science

Dept. Electrical Engineering

and Information Engineering National Tsing H u a University

&

Dept. Computer Science

National Taiwan University

Hsinchu, Taiwan

and Information Engineering

Taipei, Taiwan

Abstract

In this paper, wt: describe a n object-based distribut- ed shared m e m o r y called A d s m i t h . In a n object-based D S M , t h e shared m.emory consists of m a n y shared ob- jects, through which t h e shared m e m o r y i s accessed. A d s m i t h is built o n t o p of PVM at t h e library layer us- ing

C++.

PVM is used as t h e communication subsys- t e m because

it

is a de f a c t o standard and encapsulates m a n y s y s t e m related details. Several mechanisms are used t o improve t h e performance of A d s m i t h , such as release m e m o r y consistency, load/store-like m e m o r y accesses, nonblocking accesses, and atomic operations, etc. Performance results show t h a t even though Ad- smith is implemented o n t o p of PVM, programs run- ning o n A d s m i t h can achieve a performance compara- ble w i t h those running directly o n PVM.

1 Introduction

For ease of construction and high scalability, many high performance parallel computers today are built as distributed memory systems. In such systems, message passing is the most general programming paradig- m. With message passing, programmers are forced t o manage the d a t a flows explicitly - they have t o know where a piece of d a t a is located and when t o set up the send/receive pair between two communi- cating entities. Such a task is tedious and error-prone. Shared-memory programming, on the other hand, re- lieves programmers from managing shared d a t a explicitly. Thus programs can be developed more easily. Combining distributed-memory architecture with shared-memory programming is thus a right choice for parallel computers. Based on this observation, dis- tributed shared m e m o r y (DSM) was proposed and has

attracted much attention [8].

A DSM provides a logically shared memory on top of a network of computers with a physically distributed memory. Previous approaches t o DSMs usually partition the shared-memory addressing space in- to logical fix-sized pages, which are distributed t o the nodes in the system. Through a memory manager, the nodes can access t o any page in the shared addressing space. Such a DSM system is referred t o as

National Taiwan University

Taipei, Taiwan

a block-based DSM in this paper. Block-based DSMs

are often taken as a n extension of traditional virtual memory systems, and thus are usually implemented

at the hardware and/or operating system layers. One advantage of such a n implementation is transparen- cy - the memory system is totally hidden from the users. However, block-based DSMs have the problem of choosing the right size for the blocks. The block size depends not only on the system characteristics but also on the applications. Another problem is the design complexity. Implementation a t the hardware and/or operating system layers needs t o modify existing systems, which requires a tremendous effort. Also, these implementations are usually system dependent. Thus

it is very difficult t o implement block-based DSMs on top of heterogeneous systems.

Another approach t o DSM is object-based, which

partitions the shared memory according t o logical da- t a structures. Object-based DSMs are most often implemented a t the language/compiler or library layers. DSMs implemented a t this layer usually require some modifications at the users’ end - either to the programming model, language, or style. Thus, transition from sequential computers t o such a n environment is not so seamless and transparent. Since object-based DSMs are implemented a t a rather high layer in a

computer system, performance is not as good as that of blocked-based DSMs. However, object-based DSMs do offer some unique and important features.

An implementation at higher layers makes object- based DSMs very flexible and system independent. Programmers have a larger control over how d a t a are distributed according t o the characteristics of the application. Also, programmers can determine important parameters, such as the block size, access method, communication mechanism, memory consistency model, etc., easily in object-based DSMs. The DSMs can be directly implemented on top of existing generic communication subsystems, such as T C P / I P , PVM, or DCE. Not only that system development efforts can be reduced dramatically, but also porting t o different architectures is straightforward. Furthermore, the re- sulting system can be easily modified and improved

(2)

when newer techniques a r e available.

Object-based DSMs can be implemented a t the language/compiler layer or the library layer. For the former implementation, new compilers or preprocessors must be developed. Techniques involved in automat- ic d a t a partition and distribution, parallelism extrac- tion, and communication optimization are still imma- ture and the development efforts are enormous. For library-layer implementations, we only have t o support primitives which may be useful for programming and compiler, such as those for distributing data, for utilizing efficient memory consistency model, and for performance tuning. Thus, implementing the DSM in library layer is more feasible.

Another issue in implementing object-based DSM-

s is choosing a suitable communication subsystem. An ideal communication subsystem must be general enough and have well-defined communication interface, in which system details are encapsulated. PVM is one of the best choices. PVM stands for Parallel Virtual Machine [lo]. It enables a collection of heterogeneous computers t o be used as a coherent and flexible concurrent computational resource [3]. PVM provides process management, message buffering, and other useful message passing utilities. PVM has been ported to many systems and is a de facto standard in high performance computing. A DSM built on top of PVM can support both message passing and shared memory programming. Thus for accesses with known patterns, message passing can be directly used to minimize the number of messages, but for unknown patterns, shared memory accesses can be used [I].

We have designed and implemented an efficient object-based DSM called Adsmith. It is implemented a t the library layer on top of PVM. Programmers use the system through C++. Although Adsmith can be easily ported t o different architectures through PVM, it currently does not support heterogeneous environments for efficiency reasons. Adsmith supports many options that allow the users t o specify the properties of each declared object, d a t a access methods, d a t a distribution, memory consistency policies, and communication mechanisms. Although Adsmith is built on top of PVM, applications with careful design can still achieve good performance on Adsmith. Later we will compare applications running on Adsmith and on the bare PVM.

In this paper, we will investigate the issues involved in the implementation of Adsmith and describe its us-

er interface. The rest of the paper is organized as follows. In Section 2, we introduce the design strategies of Adsmith. In Section 3, the user interface and the programming style of Adsmith are described. In Sec- tion 4, we show some preliminary performance results of Adsmith. We conclude this paper in Section 5 .

2 Implementation Strategies

Since Adsmith is totally independent of the under- lying operating system, its performance may suffer. Reducing the number of messages and the communication latency is very important. Several methods are used in Adsmith to solve this problem, including the

use of the Release Consistency memory model' [4],

load/store-like d a t a accesses (Section 2.2), nonblocking accesses (Section 2.4 and 3.4), and atomic accesses (Section 3.7.) We will describe these techniques in more detail below.

2 . 1

Communication S u b s y s t e m

Generally, active messages [2] are used in DSMs to eliminate the need for message buffering and to reduce the access latency. Unfortunately, Adsmith cannot use active messages, because PVM does not provide needed supports. Note that software active messages usually need the help of system dependent function-

s. Since the communication details are encapsulated by PVM, using any system dependent functions directly in Adsmith could be very dangerous. Beside, PVM uses nonblocking sends. Thus, message buffers in user space always exist. As a result, Adsmith does not employ active messages but use nonblocking send-

s instead. One advantage of nonblocking sends is t o overlap computations and communications, which is important for high performance parallel computing.

2 . 2

D a t a G r a n u l a r i t y

Data granularity is the unit size of the d a t a during internal d a t a allocation and external transmission. In object-based DSMs, users have the freedom of specify- ing the d a t a granularity. As a result, the false-sharing problem can be avoided.

A problem with object-based DSMs is that the shared objects tend t o be small. Since each reference to a shared object may cause a read or write request, many messages may be generated. Since memory references exhibit locality, we solve the above problem with a load/store-like m e m o r y access style. A load

operation is performed only for the first read access, and a store operation for the last write access. Other accesses for the shared object in the program segment can be performed locally through a cached copy. This is similar t o the d a t a access methods in load/store architecture. Load/store operations in Adsmith will be described in Section 3.

2 . 3

D a t a Distribution

Under object-based DSM, users have a larger control over how shared d a t a are distributed - based on

the application behavior and machine characteristics. In our implementation, the home location of a shared object is randomly selected by default. The programmer or parallelizing compiler can also determine how the shared objects are distributed.

Allowing the home nodes t o be moved from time to time is not practical in distributed environments, because many messages will be produced due t o the change of home nodes. Adsmith fixes home nodes to simplify the implementation. The problem of such a scheme is that the home node may not be the one which accesses t o the object the most. Programmers can help t o solve this problem by setting the home n- ode t o the host that has the most references to the object. If the process with the most references changes a t different execution phases, the programmer/compiler can also force the home to be changed. The ability t o

(3)

manually change the home nodes is now under development in Adsmith.

2.4 Write Policy and Coherence Protocol

Two general write policies are write-through and write-back. In Adsmith, both write policies are supported. Since load/store-like memory accesses are used, the last, write accesses are always delimited by the programmer. Thus write accesses need not actual- ly be performed t o the home node until the last write access is encountered. The last write access can use a write-through polic.y, while others a write-back policy.

Pipeline write means that several write requests can

be outstanding at the same time. Pipeline write will have no benefit if the write function is invoked for every write access, because this will produce a large amount of messages [7]. Since most accesses in Ad- smith are done locally, pipeline write can be used t o overlap the communication with the computation.

There are two general methods for d a t a coherence: write-invalidate and write-update. Write-update will update all the copies of the written data, while write- invalidate will only invalidate the copies. Write- update needs to include the content of the modified d a t a in its coherence message. Since Adsmith is a n object-based DSM, most objects are small. A write- update and a write-invalidate message will have similar communication costs. To allow maximum flexibil- ity, Adsmith supports both.

2.5 Memory Model

Release consistency (RC) is implemented in Ad- smith. Shared accesses in RC are classified as competing accesses (special accesses) and noncompeting accesses (ordinary accesses). Competing accesses mean that two or more accesses may refer t o the same shared memory location ai the same time and a t least one is a write access. Special accesses are further catego- rized as synchronizakion accesses and nonsynchronization accesses. Nonsynchronization accesses are competing accesses which are not used for synchronization purposes. Synchronization accesses are further divided into acquire accesses and release accesses. Adsmith provides all these access operations. Programmers are responsible for writing properly-labeled programs by utilizing these operations [4].

2.6 Architecture

of

Adsmith

Adsmith is completely built on top of PVM. A daemon will be spawned for each host to support run- time shared object handling. Internal manipulations of shared objects are totally transparent t o the users. The basic organization of Adsmith is shown in Fig- ure 1.

The architecture can be divided into two layers:

Logzcal Shared M e m o r y Layer (LSML) and Process Bufler Layer (PBL). LSML is supported by the dae-

mons. Each daemon will interact with the application processes t o provide shared-memory services. Shared objects are distributed t o the memories of the partic- ipating hosts. PBL exists in each application process. Shared d a t a are buffered in PBL and refreshed from and flushed to LSML when necessary. Note that there

is no limitation on the buffer size. The whole local memory can be used as buffers. Status information

of shared objects are distributed in the d a t a mapping directories on each daemon and the referencing processes.

3

P r o g r a m m i n g o n A d s m i t h

Adsmith is implemented as a user level library in C++ with PVM as its communication platform. It can be viewed as adding a DSM layer on top of PVM. Both the PVM message-passing library and the Ad- smith shared-memory library are accessible a t the same time. In this section, we introduce the main functions provided in Adsmith.

3.1 System

and

Process Control

The progra.mmers need not do any initialization or

termination explicitly. All these works are automat- ically accomplished by the object initialization facili- ties provided by C++. T h e library contains a system object, which is responsible for system initialization and termination.

For process creation, although PVM has provided the function pvmspawn(), we require that child processes be created through adsmspawn() from Ad- smith. This is because some system information will be transmitted during the process creation time.

3.2 Shared Object Allocation and Deallo-

cation

Shared objects can only be allocated a t run time. Two forms of the allocation function are supported:

void

*

adsmmalloc( char *identifier, int size,

void

*

adsmmalloc( char *identifier, int size,

int hint = AdsmDataDefault );

void *init, int hint = AdsmDataDefault

);

In the declaration, size is the size of the shared object

and identifier is the string name used to refer to the

shared object. All shared objects must be allocated before they are used. The parameter, init, in the sec- ond form is used t o set the initial value for that shared object.

Several options can be set through the hint param-

eter t o affect the access behaviors of a shared object. Currently, the value of hint may be AdsmDataCache, AdsmDataLocal and AdsmDataUpdate. AdsmData-

Cache means that the shared d a t a will be cached in application processes and managed through the coherent protocol. AdsmDataLocal means that the shared object will be allocated on the local host. This can be used when most accesses of the object are performed by the local processes. AdsmDataUpdate means that write-update will be used as the coherence protocol for the declared object. By default, write-invalidate is used if AdsmDataCache is selected. All these values can be set simultaneously by the or operation in C++.

After the shared object is done referencing, the buffer space can be freed by the following function:

adsmfree( void

*

p t r

);

Freed objects can be reused by reallocating them a-

(4)

3.3 S h a r e d A r r a y Declaration

Arrays are often used in scientific computations and the distribution method will significantly affect the execution efficiency. Adsmith allows the programmer t o specify the distribution method of a n array. The array allocation function has two forms:

void adsmmallocarray( char

*

identifier,

int elmt-size, int num, void

*

array,

int hint = AdsmDataDefault );

void adsmmallocarray( char

*

identifier,

int elmt-size, int num, void

*

array,

int

*

dist, int hint = AdsmDataDefa.uk );

We explain this function by the following example. Assume that we want t o declare a two dimensional array of integers according t o a certain distribution method. The code will look like this:

int *C[N] [NI;

//

pointers t o allocated elements int distC[N][N];

/ *

distribution array, which contains the home node for each element

*/

adsmmallocarray

(

”arrayC”,sizeof(int),

N*N,C,distC);

//

allocate

. . .

refer t o each element by *C[i]

b3

. . .

The identifier “arrayC” is the name of the whole array. Each element can be referenced by “array- C[i]”, where i means the i-th element. The home n- ode where each element is distributed t o is specified in distC[N][N]. After allocation, pointers t o the N*N shared objects are stored in array C. Further references to the shared array can then be made through the returned pointers, i.e., *C[i]b]. Usually, the distribution is determined by the parallelizing compiler or

the programmer.

3.4

O r d i n a r y Accesses

After a shared object has been allocated, an address will be returned, which points to the buffer space of the shared object. As described previously, Adsmith

uses a load/store-like memory access style. Thus most shared object accesses are done on the local buffer. Actual accesses t o the shared memory must be performed through the following two operations. They refresh the buffers from and flush their contents to the shared memory when necessary.’

adsmrefresh( void

*

ptr

);

adsmflush( void

*

p t r

);

Since no hardware or operating system related fa- cilities are used in Adsmith, d a t a should be manually refreshed (loaded) from LSML by the programmer before they are accessed. Similarly if d a t a are modified, they should be flushed (stored) back to LSML after they are referenced. Under RC, adsmrefresho is for ordinary loads, and adsmflush() is for ordinary stores. The value refreshed is guaranteed t o be as up-to-date as that a t the time of the last acquire (see the next section).

For more efficient d a t a accesses, Adsmith also supports nonblocking load, i.e., d a t a prefetching, through the following function.

adsm-prefresh( void

*

p t r );

The function is very similar t o adsmrefresh() and the programmer can insert the prefetch function before the first load access as far as possible. The sequence of shared d a t a accesses in Adsmith is depicted as follows:

Acquire --+ Prefresh --+ Refresh -+

Local Accesses t Flush t Release

where Refresh and Flush are ordinary accesses dis- cussed above. Nonblocking load and store will be performed between prefresh-refresh and flush-release pairs respectively. The code segment below is a typical example to perform computations on a shared object within a critical section:

’If AdsmDataCache is specified, adsmrefresh() may be performed locally without the need of any communication.

(5)

typeA *A = (typeA*) adsmmalloc(

” A ” , sizeof( typeA));

AdsmMutex: mutex(”mutex name”);

/ /

AdsmMutex is a synchronization class

mutex.lock(); adsm-prefresh( A);

.

. .

prologue of computation

. . .

adsm refresh( A);

. .

.

computations with local access on A 1 .

adsmflush(A);

. .

. epilogue of computation

. . .

mutex.unlock();

Note that adsm.refresh() is still required for the first load access to “ i r e that the requested d a t a has arrived.

3.5 Synchronization Accesses

Ordinary accesse:3 require that the programmer use enough synchronizations t o ensure the correctness. Adsmith provides three classes of synchronization operations: counting semaphore, mutex and barrier. The public methods are listed as follows:

AdsmSemaph0re::wait

();

Adsm Semaphore: : signal() ;

AdsmMutex::lock(); AdsmMutex::unlock();

AdsmBarrier::barrier( int count );

Among the synclhronization functions, semaphore wait, mutex lock and barrier are acquire accesses, and semaphore signal, mutex unlock and barrier are release accesses. An acquire is needed in order t o gain the access right to a set of d a t a , and a release is used t o grant the access right. The RC model guarantees that ordinary accesses after a n acquire will obtain the most up-to-date d a t a available at the time of the acquire.

3.6 Nonsynchronization Accesses

Adsmith has two nonsynchronization functions:

adsmrefreshinow( void

*

ptr ); adsmflushnow( void

*

p t r );

These two accesses will be performed without waiting for previous ordinary accesses. T h a t is, a write through adsmflush .now() will be seen immediately by all the following loads through adsmrefresh-now(), even when they are invoked by other processes.

3.7 Atomic Accesses

Consider accessin,g a shared object in a critical section. The number of message required may be a t most seven, including two for acquire, two for refresh, two for flush, and one for release. It will be expensive when there is only one object in the critical section.

The problem can be solved by allocating the synchronization arbitrat#or t o the home node of the shared object and combining these two operations. During a n acquire, the requested d a t a can be piggy-backed on the lock grant message. After the computations, the modified d a t a can also be sent back with the release message. In this way, the required messages will be reduced to four (two for acquire and refresh, and two for

flush and release) at most. Since most of the shared objects are small, carrying the d a t a contents directly in the acquire/release messages should not affect the performance.

Adsmith provides atomic accesses to support this

kind of accesses. Two functions are supported:

adsmatomic-begin( void * p t r ,

adsmatomic-end( void * p t r

);

int type

=

A d s m A t o m i c W r i t e

);

The function adsmatomic-begin ) can be viewed as a

adsmatomic-end() as a combination of flush and release. Note that in Adsmith these operations are cate- gorized as nonsynchronization accesses. It can not be used as synchronization accesses, because coherence of other shared objects are not maintained here. Here is a n example of atomic accesses modified from that in Section 3.4.

combination of acquire and refres

6

,

while the function

typeA *A =(typeA*) adsmmalloc(”A”,

adsmat omicbegin( A) ;

. . .

computation with local access on A

. .

adsm-atomic-end(A);

sizeof( typeA));

Let the program segment between adsmatomic- -begin() and adsmatomic-end() be called atomic sec- tion. Two types of atomic operations can be specified

in the type parameter in adsm-atomic-begin(): Ads-

mAtomicWrite and AdsmAtomicRead. The former is t o indicate that both read and write accesses are in- cluded in the atomic section; and the latter is t o indicate that only read accesses exist in the section. Ad- smith implements single-writer/multiple-readers pro- tocol. For a writer, adsmatomicebegin() can be performed only when there are no readers nor writers in the atomic section. For a reader, adsmatomic-begin() can be performed only when there is no writer in the atomic section. For fairness purpose, Adsmith implements the writer first protocol. Tha t is, when a writer

is waiting to enter the atomic section, readers which come after the writer will be blocked until the writer has finished its atomic section. Of course, readers before the writer can proceed until they all exit their atomic sections.

3.8 Pointer

Pointers in shared memory are supported in Ad- smith, but the usage is not so straightforward. This is because the address of a shared object in one process may not be the same a s that in the other process. Thus, the programmers are required t o translate the local address of a shared object to a globally recog- nizable address before the address is passed to other processes through the shared memory. Functions for pointer manipulations are as follows:

int adsmgid( void * p t r );

(6)

The function a d s m g i d ( ) translates the local address of a shared object into its global address, which is rep- resented by an integer. The function a d s m a t t a c h ( ) , on the other head, translates a global address back to the local address for the requesting process.

For example, if the programmer wants to pass the address of shared object T from process P l to process

PZ

through the shared object (pointer) S, the following two code segments can be used.

/ /

process P1 sets the pointer

A d s m B a r r i e r Bpointer(”barrier for this code”); sometype *T = (sometype*) a d s m m a l l o c (

”target d a t a ” , sizeof(sometype)); int *S = ( i n t * ) a d s m m a l l o c ( ”pointer to T”,

*S

= adsm-gid(T);

//

get global address of T

a d s m f l u s h n o w ( S ) ;

/ /

flush immediately Bpointer.barrier(2);

//

done

//

process

PZ

gets the pointer

A d s m B a r r i e r Bpointer(”barrier for this code”); int

*S

= ( i n t * ) a d s m m a l l o c ( ”point to T ” ,

Bpointer.barrier(2);

//

wait until

P1

is done a d s m r e f r e s h n o w ( S ) ;

/ /

get the pointer value

/ /

attach the pointer into local address space

sometype *T

=

(sometype*)adsmattach(*S); sizeof(int));

sizeof(int));

4 Performance Evaluation

In this section we study the performance of Ad- smith through an application program that solves the Traveling Salesman Problem. We will compare the performance of the application programs developed in Adsmith and in PVM. The PVM version was written following the master-slave programming model. We ported it onto Adsmith using the SPMD (Single Pro- gram Multiple Data) model with the algorithm un- changed. There are two major communication parts in the program, which are used to compute a global maximum (minimum) from local maximums (mini- mums). The related code segments are listed below.

P V M v e r s i o n :

Master Program

int slaves[SLAVE-NUM];

/ /

slave tids

/ /

get local maxs and compute the global max

float max=0.0;

for (int i=O; i<SLAVE-NUM; i++) {

pvmrecv(-l,SOME-TAG);

pvm-upkfloat(&local-max,l,l); if ( l o c a l m a x > m a x ) max=localmax;

I

//

broadcast the global m a x pvminitsend( PvmDat aDefault) ;

pvm-pkfloat(&max, 1,l);

p v m m c a s t (slaves,SLAVE-NUM ,SOME-TAG);

Slave Program

int master;

/ /

master tid

//

compute local m a x float l o c a l m a x =

.

*

.

//

send local m a x t o master pvminitsend( PvmDataDefault); pvm-pkfloat( &localmax, 1,l); pvmsend(master,SOME-TAG);

/ /

get global max from master float max; pvmrecv(master,SOME-TAG); pvm_upkfloat(&max, 1,l); A d s m i t h v e r s i o n :

//

compute local m a x float l o c a l m a x =

. . .

float *max=(float*)adsmmalloc(

/ /

compute m a x thru atomic operation

a d s m - a t o m i c - b e g i n ( m a x ) ;

if (local-max>*max) *max=localmax; a d s m a t o m i c - e n d ( m a x ) ;

//

wait for all processes done A d s m B a r r i e r Btsp(”tsp”); B tsp

.

barrier (PROC-NUM) ;

//

get the global m a x a d s m r e f r e s h ( m a x ) ;

”max” ,sizeof(float));

Atomic accesses are used in the Adsmith version, because there is only one shared object in the critical section. From the code, we can roughly compute the ratio of the number of messages required by the PVM version to t h a t by the Adsmith version, which is about 1:4. One reason for the larger number of messages in the Adsmith version is because we did not write the Adsmith version from scratch, but only translate it from the PVM version directly.

Two Sparc 2 workstations were used in the experi- ment. The performance results are shown in Table 1. Surprisingly, we find t h a t as the problem size increas- es, the execution time of the Adsmith version becomes closer t o t h a t of the PVM version. It is even short- er when the number of cities is greater than 10,000. One explanation is t h a t the communications are over- lapped between processes in Adsmith. Besides, the implementation overheads are lower in Adsmith.

5 Conclusion

In this paper we have introduced a n object-based approach to DSM designs a n d described how such a

system, called Adsmith, is built. Major features of Adsmith are listed below:

1. It is a n object-based DSM implemented as a user- level library in C++ on top of PVM.

2. It has a two-level memory hierarchy: process buffer layer and logical shared memory layer. 3. Information about shared objects is distributed

in both daemons and application processes. Pro- grammers are allowed t o specify the distribution of shared objects/arrays.

(7)

[

Cities

I

PVM version

I

Adsmith version ] Speedup

I

Table 1: Execution times for the T S P program

4. The home nodes of shared objects are fixed. Pro- grammers can determine whether the shared object is t o be cached or not.

5. It employs a load/store-like d a t a access style and the Release Consistency model is fully supported. 6. It provides atomic accesses t o minimize the num-

ber of messages.

7. Nonblocking store (pipeline write) and nonblocking load (prefeitch) are used to overlap communications and computations.

8. Both write-through and write-back are support- Different shared objects can have different ed.

coherence protocols.

Since we built our system on top of PVM, the implementation considerations were quite different from others. For example, using PVM prevents us from adopting the active message mechanism, and a higher communication overhead is involved. Thus, our major task is t o reduce the number of messages. Many features listed above help t o achieve this goal. In ad- dition, many flexibilities are provided to help performance tuning, especially for the parallelizing compilers. For example, shared objects can be set t o use cache or not, t o use write-update or write-invalidate, etc. Home nodes can be assigned by the programmer or the compiler, so can the distribution of shared arrays. Also prefetching is supported t o hide the load access latency. Preliminary experimental results show that Adsmith is efficient and can achieve very good performance.

References

[l] Tzi-cker Chiueh and Manish Verma,

“A Compiler-Tbirected Distributed Shared Mem- ory System,” 9th A C M International Conference on Supercomputing, 1995.

[2] Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, Klaus Erik Schauser, “Active Messages: a Mechanism for Integrated Commu- nication and Computation,” In Proceedings of the 19th International S y m p o s i u m o n Computer A r - chitecture, May 1992.

[3] A. Geist, et al., P V M 3.0 User’s Guide and Ref- erence Manual, Oak Ridge National Laboratory,

1993.

[4] K. Gharach.orloo, D. Lenoski, J . Laudon, P. Gib- bons, A. Gupta, and J. Hennessy, “Memory Con-

sistency and Event Ordering in Scalable Shared- Memory Multiprocessor,” In Proceedings of the 17th A n n u a l International Symposium on Com- puter Architecture, pp. 15-26, May 1990.

[5] Kirk L. Johnson, M. Frans Kaashoek, and Deb- orah A. Wallach, “CRL: High-Performance All- Software Distributed Shared Memory,” In Pro- ceedings of the Fijleenth S y m p o s i u m on Operating Systems Principles, December 1995.

[6] Pete Keleher, Alan L. Cox, Sandhya Dwarkadas and Willy Zwaenepoel, “TreadMarks: Distribut- ed Shared Memory on Standard Workstations and Operating Systems,” In Proceedings of the 2994 W i n t e r Usenix Conference, pp. 115-113, J an

1994.

[7] Wen-Yew Liang, “ADSMITH: A Structure-based Heterogeneous Distributed Shared Memory on PVM,” Master Thesis, National Tsing Hua U-

niversity, Taiwan, June 1994.

[8] B. Nitzberg and V. Lo, “Distributed Shared Memory: A Survey of Issues and Algorithms,”

I E E E Computer, Vol. 24, No. 8, pp. 52-60, Aug

1991.

[9] Steven

K.

Reinhardt, James R. Larus, and David A. Wood, “Tempest and Typhoon: User-Level Shared Memory,” In Proceedings of the 21th A n - nual International S y m p o s i u m o n Computer A r - chitecture, April 1994.

[lo] V.S. Sunderam, “PVM: A Framework for Parallel Distributed Computing,” Concurrency: Practice and Experience, Vol. 2 , No. 4, Dec. 1990.