Incorperating Memory Resource Considerations into the Workload Distribution of Software DSM systems

(1)

Short Paper

_________________________________________________

Incorporating Memory Resource Considerations into

the Workload Distribution of Software DSM Systems

YEN-TSO LIU1_{, T}_YNG_-Y_EU_L_IANG2_{, J}_YH_-B_IAU_C_HANG3_AND_C_E_-K_UEN_S_HIEH1

1_{Department of Electrical Engineering}

National Cheng Kung University Tainan, 701 Taiwan

2_{Department of Electrical Engineering}

National Kaohsiung University of Applied Sciences Kaohsiung, 807 Taiwan

3_{Department of Information Management}

Leader University Tainan, 709 Taiwan

E-mail: {andy1_{; lty}2_{; andrew}3_{; shieh}1_{}@hpds.ee.ncku.edu.tw}

Conventional workload distribution schemes for software distributed shared mem-ory (DSM) systems simply distribute the program threads in accordance with the CPU power of the individual processors or the data-sharing characteristics of the application. Although these schemes aim to minimize the program execution time by reducing the computation and communication costs, memory access costs also have a major influence on the overall program performance. If a processor has insufficient physical memory space to cache all of the data required by its local working threads, it must perform a se-ries of page replacements if it is to complete its thread executions. Although these page replacements enable the threads to complete their tasks, thread execution is inevitably delayed by the latency of the page swapping operations. Consequently, the current study proposes a novel workload distribution scheme for DSM systems which considers not only the CPU power and data-sharing characteristics, but also the physical memory ca-pabilities of the individual processors. The present results confirm the importance of considering memory resources when establishing an appropriate workload distribution for DSM systems and indicate that the proposed scheme is more effective than schemes which consider only CPU resources or memory resources, respectively.

Keywords: workload distribution, distributed shared memory (DSM), memory resource,

page replacement, data-sharing characteristic

1. INTRODUCTION

Obtaining a suitable workload distribution is essential if the performance of programs executed in parallel on clusters of computers is to be optimized. Most users generally partition their problems evenly into a number of threads and then distribute these threads equally onto each processor in an attempt to achieve a workload balance. However, the

(2)

processors clustered in computer networks are generally not identical in terms of their resource capabilities, e.g. CPU power, physical memory space, network bandwidth, etc. Consequently, some processors may possess insufficient resources to match the demands of their local threads, while others may have insufficient work to fully exploit their avail-able resources. This not only delays the overall finish time of the program, but also re-duces the return obtained from the original investment in the computational resources. Therefore, it is necessary to develop a more sophisticated scheme to distribute program workloads onto clustered computers in accordance with their specific resource capabili-ties.

Recently, software distributed shared memory (DSM) systems have been success-fully applied for the parallel scheduling of program workloads onto clustered computers. These run time systems use software technology to construct a virtual shared memory abstract over the clustered computers. With this virtual shared memory abstract, users can employ shared variables rather than message passing to develop their applications. This greatly reduces the complexity of programming applications designed for execution on clustered computers. However, DSM systems suffer the same workload distribution problems as those described in the paragraph above. Although, some DSM systems em-ploy dynamic workload distribution schemes to optimize the performance of user appli-cations [1-4], these schemes consider only the CPU power and the data sharing charac-teristics of the application when distributing the program threads onto the clustered proc-essors.

Besides computation and communication costs, memory access costs also play a key role in determining the overall program performance. If a processor has insufficient physical memory space to cache all of the data demanded by its local threads, it will be required to perform page replacements at run-time whenever its local threads attempt to access data which is not located in its physical memory. Although virtual memory tech-nology enables the processor to complete the tasks of its local threads, the memory swap-ping routines inevitably postpone the thread execution. The rapid advances made in VLSI technology over the past decade have increased the relative influence of memory swapping delays on the program performance since the speed differential between the CPU and external storage devices is greater than that between the CPU and its physical memory. Therefore, workload distribution methods which take no account of memory resources are liable to make flawed decisions which degrade rather than enhance the per-formance of DSM programs.

In order to resolve this problem, this paper discusses the inclusion of memory re-source considerations in the workload distribution planning of software DSM systems. In the first step of the present research, the memory resources of the processors are taken into account in deriving a set of formulae with which to predict the execution time of each processor for a specific thread-mapping pattern when executing users’ DSM pro-grams. These formulae provide the means to identify the thread-mapping pattern which improves the system performance by simultaneously considering the computational and memory resources of each processor and the data sharing characteristics of the DSM ap-plication. Having identified the optimum thread-mapping pattern, the current thread- mapping pattern is adjusted at run-time in order to reduce the delay latencies between processors, hence improving the performance of the DSM application. In the second stage of the current study, the proposed workload distribution scheme is implemented on

(3)

a testbed known as Teamster [9] in order to examine its effect on the performance of various test applications.

The remainder of this paper is organized as follows. Section 2 discusses previous studies of workload distribution on DSM systems. Section 3 analyzes the impact of memory resource considerations on the performance of DSM applications and develops analytical formulae to establish the iteration finish times of the executing processors un-der different thread mapping patterns. Section 4 discusses the architecture of the pro-posed workload distribution scheme on Teamster. Section 5 presents the results of a se-ries of experiments designed to investigate the effect of considering memory resources on the application performance when planning the workload distribution of DSM sys-tems. Finally, section 6 draws some brief conclusions and highlights the areas of in-tended future research.

2. RELATED WORKS

Current DSM systems capable of supporting workload balancing via dynamic work-load redistribution include CVM [5], JIAJIA [6] and Cohesion [7]. CVM focuses on obtaining the workload balance which minimizes the communication costs incurred when maintaining data consistency. CVM achieves this by distributing the program threads onto processors in accordance with the computational powers of the individual processors and the computational demands of the working threads. Furthermore, CVM co-locates pairs of threads having the highest degree of mutual data sharing on the same node in order to minimize inter-node communication costs. JIAJIA assumes that each processor has sufficient physical memory space to hold the data required by the threads assigned to it. Adopting a similar approach to that taken by CVM, JIAJIA distributes the program workload based on the computational powers of the individual processors. The third system, Cohesion, divides the program workload distribution task into two phases, namely the migration phase and the exchange phase. In the migration phase, Cohesion estimates the appropriate workload for each processor using the same approach as that employed by CVM and then migrates threads from the heavily loaded nodes to the more lightly loaded nodes in order to minimize load imbalance costs. Meanwhile, in the ex-change phase, pairs of threads with the highest degree of mutual data sharing are co-located on the same node in order to reduce the communication costs incurred by thread exchanges.

The three systems described above all neglect the memory resources of the individ-ual processors in the computer cluster when distributing the application workload. Some researchers [12] have considered both processor and memory factors when developing the workload distribution policy for applications designed for execution on distributed systems. In the proposed study, the system adopts a CPU-memory-based allocation pol-icy if the memory resources are insufficient. To optimize the application performance, the proposed policy allocates jobs to a node only if it has sufficient memory resources to process the task. However, this approach is not suitable for DSM systems. In the parallel computation of a DSM application, the main task is generally partitioned into several subtasks and dispatched to a number of different nodes. The DSM system then collects the execution results of the individual subtasks to obtain the result of the main task.

(4)

Ac-cordingly, the extent of the parallelism between each subtask is significant interest when developing workload distribution schemes for software DSM systems.

3. FORMULA ANALYSIS

DSM applications can be categorized into three broad groups, namely fork-join,

run-to-complete and iterative [8]. Since each group has different execution

characteris-tics, it is necessary to design specific workload distribution schemes for each one. The current workload distribution investigation focuses on iterative DSM applications. Com-pared to fork-join and run-to-complete applications, iterative applications more readily provide the information required to derive precise estimates of the program execution time for different thread-mapping patterns. Furthermore, iterative applications provide a clearer view of the impact of different workload distributions on the program perform-ance.

In software DSM systems, an iterative program is generally partitioned into a num-ber of threads, which are then distributed onto processors for parallel execution. When individual threads finish their jobs within the current iteration, they must join the other threads at a barrier before commencing the next iteration. Accordingly, the finish time of any iteration is determined by the longest finish time of any processor working within that iteration. Clearly, the total execution time of the iterative program is given by the sum of the finish times of the individual iterations created in that program. Therefore, to evaluate the effect of the workload distribution on the application performance, this study develops a formula to estimate the iteration finish time of a processor with a specific thread-mapping pattern.

Basically, the iteration finish time of processor x, Tx_{, comprises three components,}

the computation time, the memory swapping time and the communication time, i.e. Tx₌

.

x x x

comp mem comm

T +T +T

The computation time, T_compx , is the time spent by processor x executing the com-putational work of its local threads. Let Sx be the set of all threads running on processor x

and i comp

t be the computation time of thread i assigned to processor x. x comp T is therefore given by . x x i comp comp i S T t ∈ =

∑

The communication time, x , comm

T is the time spent by processor x propagating its data page updates to other processors holding the same page in order to maintain data consistency. According to Liang [11], x

comm T is given by 1, 1 , N P x comm xyk y y x k T C = ≠ = =

∑ ∑

where Cxyk = i Sx ik packet packet diff Size t φ⎛⎜ ⎞⎟ ⎜_∈ ⎟ ⎝ ⎠ ⎡ ⎤ ⎢ _{⎥ ×} ⎢ ⎥ ⎢ ⎥

∪ _{if processor x and processor y share page k, and C}

xyk = 0 other- wise. In this equation, N is the number of execution processors, P is the total number of data pages, Sizepacket is the network packet size, and tpacket is the average time spent trans-ferring one message packet in the network. If processor x shares page k with processor y, processor x will send its updates for page k to processor y to maintain data consistency. The update of processor x for page k is actually the accumulation of its local threads’ updates for page k. Consequently, processor x will send the update of page k, i.e.,

(5)

, x ik i S diff ∈

∪

to processor y if the threads located on processor x have written page k. The number of message packets for sending

x ik i S diff ∈

∪

is then given by i Sx ik , packet diff Size φ⎛⎜ ⎞⎟ ⎜ ⎟ ⎜ ⎟ ⎝∈ ⎠ ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ∪ where x ik i S diff φ ∈ ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎝

∪

⎠ is the size of . x ik i S diff ∈

∪

Finally, the memory swapping time, T_memx , is the time spent by processor x per-forming page replacements to cache the data accessed by its local threads. Let Mx repre-sent the maximum memory space which processor x can afford for thread memory de-mands. Furthermore, let mi be the memory space requested by thread i. In the case where a processor has sufficient memory space to cache all of the data accessed by its local threads, the latency of memory accesses can be neglected. However, if the processor has insufficient memory space, page faults will occur frequently during thread execution as virtual memory mechanisms are triggered to perform page replacements. The memory accesses of the threads will be delayed due to the latency of executing these page re-placements. Consequently, the memory swapping time of processor x can be denoted as:

, if = 0, if x x x i mem i x i S i S x mem i x i S t m M T m M ∈ ∈ ∈ ⎧ > ⎪⎪ ⎨ ≤ ⎪ ⎪⎩

∑

∪

where t_memi is the memory swapping latency of thread i assigned to processor x.

The time required to execute a page replacement can be divided into two discrete components. The first component is the time spent scanning the physical memory to identify the least-recently used (LRU) data pages and then swapping these pages out to disk. The second component relates to the time spent swapping in the data pages required by the threads from disk to the physical memory. Therefore, the memory swapping latency of thread i assigned to processor x can be expressed as t_memi = fi×

(

t_spix +t_spox

)

, where f i_{is the number of page replacements executed by processor}_x_{in caching the data}

pages required by thread i, x spo

t is the average time spent by processor x searching for the

LRU data pages and swapping one page out to the swapping device, and x spi

t is the av-erage time spent swapping in one page on processor x. Therefore, x

mem

T can be divided into two discrete components, i.e. the memory swapping-in time, x ,

spi

T and the memory swapping-out time, x .

spo

T x spi

T means the total time spent by the system swapping in pages for the DSM application on processor x. x

spo

T is the total time spent by the system searching for the LRU data pages on processor x and then swapping these pages out.

In general, the number of swapping-out operations is equal to the number of swap-ping-in operations, and the occurrence of these operations varies as a function of the memory deficiency. However, most UNIX-like operating systems use an individual page scanning [13] process to scan the entire physical memory space in order to identify the LRU pages which can be swapped out. The performance of this page scanner is directly proportional to the processor power. Additionally, the probability of identifying LRU pages for page replacement is directly related to the size of the available physical mem-ory which the system can offer the DSM application. In other words, T_spox is inversely

(6)

formance of DSM applications. The influence of memory resources on the DSM program performance has been thoroughly analyzed, and the results of this analysis have been applied to develop a novel workload distribution scheme implemented on a testbed re-ferred to as Teamster. The proposed workload distribution scheme efficiently maintains the workload balancing of user DSM applications with comparatively low overheads. The experimental results have confirmed that memory swapping latency costs play a crucial role in determining the overall performance of DSM applications. The proposed workload distribution scheme outperforms conventional methods which consider the processor power only or the available physical memory only when attempting to mini-mize the execution time. The workload distribution scheme proposed in this paper is in-tended for clusters of computers having only one processor. However, an increasing number of computers have more than one processor nowadays. Accordingly, in a future study, the current study group intends to develop an advanced workload distribution method for DSM systems clustered with SMP (Symmetric Multiple Processors) ma-chines.

REFERENCES

1. K. Thitikamol and P. Keleher, “Thread migration and load balancing in non-dedi- cated environments,” in Proceedings of the 14th International Parallel and

Distrib-uted Processing Symposium, 2000, pp. 583-588.

2. A. Dubrovski, R. Friedman, and A. Schuster, “Load balancing in distributed shared memory systems,” International Journal of Applied Software Technology, Vol. 3, 1998, pp. 167-202.

3. C. Lai, C. K. Shieh, J. C. Ueng, Y. T. Kok, and L. Y. Kung, “Load balancing in dis-tributed shared memory system,” in Proceedings of the IEEE International

Per-formance, Computing, and Communications Conference, 1997, pp. 152-158.

4. J. K. Hollingsworth and P. J. Keleher, “Prediction and adaptation in active har-mony,” in Proceedings of the 7th International Symposium on High Performance

Distributed Computing, 1998, pp. 180-188.

5. K. Thitikamol and P. J. Keleher, “Thread migration and communication minimiza-tion in DSM systems,” IEEE Proceedings, 1999, pp. 487-497.

6. W. Shi and Z. Tang, “Dynamic computation scheduling for load balancing in home- based software DSMs,” in Proceedings of the International Symposium on Parallel

Architectures, Algorithms and Networks, IEEE Computer Press, 1999, pp. 248-255.

7. T. Y. Liang, C. K. Shieh, and D. C. Liu, “Scheduling loop applications in software distributed shared memory systems,” IEICE Transactions on Information and

Sys-tems, Vol. E83-D, 2000, pp. 1721-1730.

8. V. W. Freeh, D. K. Lowenthal, and G. R. Andrews, “Distributed filaments: efficient fine-grain parallelism on a cluster of workstations,” in Proceedings of the 1st

Sym-posium on Operating Systems Design and Implementation, 1994, pp. 201-212.

9. J. B. Chang and C. K. Shieh, “Teamster: a transparent distributed shared memory for cluster symmetric multiprocessors,” in Proceedings of the 1st IEEE/ACM

Interna-tional Symposium on Cluster Computing and the Grid, 2001, pp. 508-513.

(7)

19th International Conference on Distributed Computing Systems, 1999, pp. 324-331.

11. T. Y. Liang, J. C. Ueng, C. K. Shieh, D. Y. Zhuang, and J. Q. Lee, “Distinguishing sharing types to minimize communication in software distributed shared memory systems,” Journal of Systems and Software, Vol. 55, 2000, pp. 73-85.

12. L. Xiao, S. Chen, and X. Zhang, “Dynamic cluster resource allocations for jobs with known and unknown memory demands,” IEEE Transactions on Parallel and

Dis-tributed Systems, Vol. 13, 2002, pp. 223-240.

13. J. Mauro and R. McDougall, Solaris Internals: Core Kernel Components, Sun Mi-crosystems Press, 2001, ISBN: 0-13-022496-0.

Yen-Tso Liu (劉晏佐) is currently a Ph.D. candidate studying at the Electrical

En-gineering Department of National Cheng Kung University, Tainan, Taiwan. He received his B.S. and M.S. degree from the Electrical Engineering Department of National Cheng Kung University in 1996 and 1998 respectively. His current research interests include distributed/parallel computing, parallel file systems, and computer networking.

Tyng-Yeu Liang (梁廷宇) obtained his M.S. and Ph.D. degrees from the Electrical

Engineering Department of National Cheng Kung University in 1994 and 2000, respec-tively. Currently, he is an assistant professor studying and teaching at the Department of Electrical Engineering, National Kaohsiung University of Applied Science, Taiwan. His research interests include cluster and grid computing, and image processing.

Jyh-Biau Chang (張志標) is currently an associated professor in the Department of

Information Management at Leader University in Taiwan. He received his B.S., M.S., and Ph.D. degrees from National Cheng Kung University in 1994, 1996, and 2005 indi-vidually. His research interest is parallel processing, distributed system, cluster and grid computing, and symmetric multiprocessing.

Ce-Kuen Shieh (謝錫堃) is currently a professor in the Department of Electrical

Engineering, National Cheng Kung University. He received his B.S., M.S., and Ph.D. degrees from the Electrical Engineering Department of National Cheng Kung University, Tainan, Taiwan. His research interests include distributed and parallel processing sys-tems, computer networking, and operating systems.