• 沒有找到結果。

A New Programming Paradigm for Achieving High Performance Computation in Network Computing Platform

N/A
N/A
Protected

Academic year: 2021

Share "A New Programming Paradigm for Achieving High Performance Computation in Network Computing Platform"

Copied!
21
0
0

加載中.... (立即查看全文)

全文

(1)(submitted to Workshop on Algorithms and Computational Molecular Biology). A New Programming Paradigm for Achieving High Performance Computation in Network Computing Platform Cho-Chin Lin Department of Electronic Engineering National I-Lan Institute of Technology I-Lan 260, Taiwan, ROC Email:cclin@ilantech.edu.tw Tel:(03)9357400 ext. 644. Fax:(03)9369507. Keywords: Parallel and Distributed Computing, Network Computing, Algorithm. Abstract Current parallel or distributed systems employs a large volumn of complete nodes to meet the timing requirement in solving applications of large scale. In general, the nodes are interconnected by a network of various topologies. Examples are clusters of workstations/PC 's and the computional grids. In such a platform, an appropriate strategy is needed to squeeze the computing power from the systems. In general, high performance computing can be achieved by e ective resource utilization. In this paper, a novel programming paradigm is proposed for achieving e ective resource utilization in network computing platform. We use two parameters to model the platform. Based on the model, a function is proposed to capture the programming feature on the platform. The function is used to group data items into messages and to schedule the messages to their destination site. By choosing a grouping function appropriately, we can maximize the utilization of the computational resources in the system. Finally, matrix multiplication and LU decomposition are used as examples to illustrate the usefulness of our programming paradigm. In this paper, we has shown that computational resources can be e ectively utilized by employing our programming paradigm..

(2) 1 Introduction Since computers were developed,they have been employed to assist human being in handling daily events. Many of the applications concerning the issues of human welfare and the science leading to a better living environment need a large volume of computations. For example, the goal of improving atmospheric modeling resolved to a 5-km scale and providing timing result is believed to require 20 TFLOPS of performance [10]. We know that most powerful sequential computers of today can not meet the computational requirement needed to implement the approach. Thus, it is obvious that a serious attack on the application requires high performance computing platform. The advance in the microprocessor and memory technologies impacts the speed of a computer. The performance of a microprocessor is advancing at a rate of 50 to 100% per year [8]. Today, the state of the art microprocessors can have computation speed up to hundreds of MFLOPS [4]. In addition to that, memory capacity is increasing at a rate comparable to the increase in capacity of DRAM chips: quadrupling in size every three years [7]. Current personal computers or workstations use hundreds of Mbytes. It seems that substantial progress has been achieved in sequential computer technology. However, the performance of the computer still can not suit the applications of increasing complexity. Thus, scalable architectures which employ parallel or distributed processing technology have been proposed to meet the computational requirements. Scalable architectures have the opportunity to challenge the applications of large scale. One of the architectural solutions for achieving scalability is a network computing system operating in MIMD or SPMD modes. Examples are clusters of workstations/PC's or computational grid [5]. They are formed by combining essentially complete processing nodes (processors, memory modules and I/O capability provided by such a network computing platform potentially provides the basic requirement of the large-scaled applications. However, the limition on the nework capacity usually degenerate the performance of the systems. Thus, it is important to give a novel approach to solve this problem. In this paper, a novel programming paradigm is proposed for performing high performance computation in network computing environment. Our paradigm achieves the goal by maximizing the utilization of the computational resources. In Section 2, we give the background of the potential problems which can incur in network computing environment. A model and notations used in this paper will be de

(3) ned in Section 3. In Section 4, our programming paradigm for acceleratimg the execution of a task is proposed. In Section 5, matrix multiplication and 1.

(4) LU-decomposition are used as examples to illustrate the usefulness of the approach. Finally, conclusion is made in Section 6.. 2 Background In a network computing system, a network is necessary for a node to receive data from another node. To accelerate the execution of a task, an adequate strategy is needed for partitioning a task to several subtasks. Those subtasks are assigned and executed curently on the nodes of the system. In the system, the operation mode of the nodes can be classi

(5) ed into two categories: a system of uni

(6) ed nodes and a system of diverse nodes. In a system of uni

(7) ed nodes, all the nodes play the same role to

(8) nish a task. It is usually referred to as a parallel computer system. In a system of diverse nodes, each of the nodes emphasizes on various functions of a task. For example, some nodes are responsible for database management or act as data providers, and others perform application-oriented computations. It is usually referred as a distributed system. No matter what categories it belongs to, eÆcient data exchange among nodes is fundamental for a system to achieve high performance computing. In general, some communication patterns may have serious impact on the performance of a network computing system. It is due to the limited capacity of the network inherited from its hardware and software. A communication pattern can be the results caused by, for example, insuÆcient bu er size, contention at network links, or contention at network adaptor. Several researches [1, 2, 9, 11] have been conducted to solve the problem. Many of those try to reorganize a communication pattern to suit the nature of an existing network in order to deliver messages eÆciently. The approach is suitable for a dedicated systems, in which all the nodes are execution the same task and stall to reorganize the communication pattern together. However, to reorganize a communication pattern may involve the nodes which do not intend to send any message. Thus, this approach may lead to a system-wide overhead. The overhead further enhances the degree of degeneration. Especially, this adverse e ect should be avoided for a sharable system running several applications concurrently. In this paper, an programming paradidm is proposed to maximize the utilization of the computing nodes without incurring system-wide overhead. That is, any message will be sent directly to its destination node without employing any intermediate nodes.. 2.

(9) 3 Model and Notations In a network computing system, a node may either serves as a data provider or a computing engine. The computing engine cooperates with the data provider to provide services for users. Any user can access services through public channels and share the computational resources with other users. In the following sections, we refers a group of data providers as a data site and a group of computing engines as a computing site. In this section, an abstract model for the network computing system is proposed. In the model, two parameters are used to capture the computational characteristics in the environment. Based on the model, a novel programming paradigm for developing high performance computing is given. By employing the paradigm, an e ective algorithm can be designed for network computing platform. Before we proceed to propose the model, several general notations will be de

(10) ned in this section. Nevertheless, other notations related to speci

(11) ed applications will be given wherever it is appropriate in the following sections. The following notations are used to denote simple mathematical operations. The purpose of the notations is to simplify mathematical expression in the following sections for clarity.. De

(12) nition 1 qni is de

(13) ned to be equal to bi=nc and rni is de

(14) ned to be equal to (i mod n). The second notation is a grouping function denoted as  . By choosing an appropriate grouping function, we can associate each of the data items to a set. Thus, it also implies that the data items are partitioned into several groups according to the relation speci

(15) ed by the function.. De

(16) nition 2 Let E =fe0; e1 ; : : : ; ei ; : : : ; ejEj. is an set of data items and M = fm0 ; m1 ; m2 ; : : : ; mj ; : : : ; mjMj 1g is an ordered set of groups. Grouping function  is a binary relation from E to M such that E  M is f(ei ; m(i) )jwhere i is an integer and 0  i < jEjg. 1. g. In Section 5, the grouping function is used to specify the sequence of data items sent by a data site. That is, the data site sends data items e0 ; e1 ; e2 ; : : : ; ejEj 1 using messages m0 ; m1 ; : : : ; mjMj 1 in order. We also assume the computing site receives the messages in the same order. In a network computing environment, when a service request is called from a remote user, a task is created to handle the corresponding request. In general, most of the resources in a network computing environment are sharable. Examples are CPU cycles, memory space, I/O channels, interconnection networks, and service access channels. Those resources should be managed fairly in order to meet the quality of speci

(17) ed services. The resource manager of 3.

(18) a computing system can be parts of the operating system kernel or the middleware built on the top of the kernel such as DQS[6]. Many factors or events can change the current state of resource allocation to a task. For examples, the current load of the system, the importance of the recently entering task, and the po;icy of employing network bandwidth are crucial to resource allocation strategy. Those factors or events are complexity. However, from the view point of algorithm designers, it is impractical to use too many parameters to capture the characteristics of a system. In this paper, we abstract the phenomenon concerned by algorithm designers into two parameters. The

(19) rst parameter of our model is used to capture the quantity of computational power allocable for a task at a computing site. Note that a computing site may consists of one or several computing engines which cooperate to solve an application problem. The parameter is de

(20) ned as follows:. . g : computational gain (measured in number of CPU cycles). It is de

(21) ned as the quantity of. CPU cycles allocable to a task for perform operations at a computing site. It is measured at the interval between two consecutive messages arriving at the computing site. In general, the computational gain can be varied between two messages. Let the messages be m0 ; m1 ; : : : ; mjMj 1. Thus, g (i) denotes the computational gain at the interval between message i and message i + 1, where 0  i < jMj 1. If i = jMj 1 , then g (i) is de

(22) ned to be the the total CPU cycles needed to complete the remaining computations.. Although the parameter g (i) is measured in number of cycles, it intends to capture the e ects of interaction among the available resources. Those resources include CPU cucles, memeory space, I/O utility, etc. Note that g (i) is the upper bound on the CPU cycles at a computing site for a task to employ. A task may employ no more than g (i) CPU cycles for performing computations at the interval between message mi and message mi+1 . The computation gain g (i) intends to normalize the available computational power at a computing site using the interval of two messages. That is, the quantity of g (i) depends not only on the usable local resouces but also on the communication channels. For examples, the communication software overhead, network latency, and network bandwidth can also a ect the quantity of g (i). Thus, it is easy to see that two tasks acquiring the same amount of CPU cycles at a

(23) xed time period may not have the same quantity of g (i). Based on the parameter g (i), we de

(24) ne accumulative computational gain as follows:. De

(25) nition 3 Let G(i) is accumulative computation gain. Then, G(i) is de

(26) ned to be equal to. Pi. k=0 g (k ).. 4.

(27) The notation G(i) is the amount of total computational gain allocable to a task at a computing site before the i + 1 th message arrives. Note that A task may employ no more than G(i) CPU cycles to complete a task. In a network computing environment, the data items sent by a message are ready for performing computations only after the message is completely received by the computing site. Ready for performing computations implies that additional operations can be performed at the computing site. We denote those computations as triggered computations. In the case of sending a long message, the data provider will the computational activity at the computing site for long time. It may lead to that a task underutilizes the computational gain allocated to it. To maximize the utilization of the computational resources at a computing site, overlapping communication with computation is a possible strategy. The strategy can be achieved by partitioning data items into several groups. Then, each of the groups is sent by a sequence of messages. As soon as the computing site receives a message, computations may be triggered. Although the amount of total computations is

(28) xed for a task, however, the newly triggered computations may not be the same for each of incoming messages. The reason is that data dependency in computations may be within a message or across messages. It will be explained in details in Section 4. Thus, shipping data items to a computing site need to be considered carefully. To capture the phenomenon described above, the second parameter will be proposed to express the amount of additional computations triggered at a computing site for each incoming message.. . f : computational

(29) llet (measured in number of CPU cycles). It is de

(30) ned as the amount. of additional computations triggered at a computing site when a message is received by the site. In general, the computation

(31) llets may be varied for a sequence of messages. Thus,f (i), the i th

(32) llet, denotes the additional computations triggered at a computing site after message mi has received at the site.. Since the amount of computations triggered for an incoming message depends on the sequence of the messages, thus, two messages of the same size may not have the same value of f . In addition, the amount of computations triggered by an incoming message should be no less than zero, thus, we have, f (i)  0; for i  0. Based on the parameter computation gain, we de

(33) ne accumulative computation

(34) llet (ACF) as follows:. De

(35) nition 4 Let F (i) is accumulative computation

(36) llet. Then, F (i) is de

(37) ned to be equal to. Pi. k=0 f (k ). The notation F (i) is the amount of total computational

(38) llets of a task accumulated at a com5.

(39) puting site before message mi+1 arrives. Note that a task may not have enough computational resources to complete F (i) computations at a computing site even though the computing site has received the message mi+1 .. 4 Programming paradigm In a computing system, a task is created to process the request from users. For a network computing system to execute a task, a computing site may need to access data items across a network. In general, the data items are sent by a sequence of messages in order to overlap communucation with computation. The tasks use the computational gain allocated by the computing site to perform operations assigned by the computational

(40) llets. The quantity of a computational

(41) llet assigned by a message depends on twhich of he data items encapsulated in the message and the order of the message in the sending sequence. We will illustrate this using a simple example. Let E = fe0 ; e1 ; e2 ; e3 ; e4 ; e5 g. The data items speci

(42) ed by E is stored at a data site p0 . The data items are sent by messages m0 and m1 for performing computations at a computing site p1 . The computations performed at p1 are e0 + e1 , e0 + e2 , e3 + e4 , e3 + e5 , e4 + e5 . Considering that the grouping fuctions 0 and 1 , each of which partitions data items into two sets. The

(43) rst grouping function 0 partitions the data items into s00 = fe1 ; e2 ; e3 g and s01 = fe0 ; e4 ; e5 g. The second grouping function 1 partitions the data items into s10 = fe0; e1 ; e2g and s11 = fe3 ; e4; e5g. Two scenarios can happen as follows. The

(44) rst scenario is that the data site sends messages m0 and m1 using s00 and s01 respectively to p1 , and site p1 receives the messages in order. By simple analysis, we have f (0) = 0 and f (1) = 5. The second scenario is that the data site sends messages m0 and m2 using s10 and s11 respectively to p1 and site p1 receives the messages in order. By simple analysis, we have f (0) = 2 and f (1) = 3. Then, it is easy to verify that if the g (0) = 3, then g (1) = 5 for

(45) rst scenario. However, for the same vakue of g (0), we have g (1) = 3 for the second scenario. It imples that the task can

(46) nishes earlier if the second strategy is employed. However, it is not the optimal strategy. We can have better solution if the data site sends messages m0 and m1 using s11 and s10 . In this case, if we have the same value of g (0), then g (1) = 2. It implies that the task can

(47) nishes even earlier than the previous two strategies do. In the network computing environment, a data site sends a sequence of messages to a computing site. The computing site provides computational gain to perform the triggered computations. From the above example, we know that If we carefully schedule data items to the computing site, then computations can be triggered earlier. Otherwise, a large volumn of computations will be accumulated to the end of communication. Thus, the execution time of a 6.

(48) task will be prolonged. In our programming paradigm, data items are partitioned into several sets by choosing an appropriate grouping function . The grouping function is chosen based on two strategies:. . Group the data items into messages of size n such that there is no or little data dependency among messages.. . Maximize the utilization of the computional gains by sending the messages according to the volumn of computational

(49) llets. That is, the message with larger computational

(50) llet volumn is sent earlier than those with smaller one.. The

(51) rst strategy is to capture the data access locality for performing a speci

(52) ed computation step. The second strategy makes e ective utilization of computation resources for the task performed at the computing site. Our goal is to minimize the g (jMj 1). Based on the above statement, we know that function  not only partitions data items but also assigns the delivery order for each of the sending messages.. 5 Applications Matrix multiplication (MM) and LU decomposition (LUD) are basic but important in scienti

(53) c computations. In this section, MM and LUD are used as examples to illustrate the usefulness of our paradigm. For each of the computations, algorithms are proposed to generate various patterns of computational

(54) llets. The various patterns lead to di erent utilization rates of the ACG's.. 5.1. Matrix multiplication. In this section, matrix multiplication is used to illustrate our paradigm for developing algorithm in a network computing environment. The scenario of the computation is described as follows. A network computing system consists of a data site p0 and a computing site p1 . Initially, matrices A and B of size n  n are stored at data site p0 . After the system has receiveed a request of matrix multiplication A  B from a remote user, the computing site p1 starts to receive data items from p0 and performs matrix multiplication. The multiplications are performed at site p1 as soon as the computations are triggered at the site. The site p1 uses matrix C for storing temporary and

(55) nial result. 7.

(56) begin for 0. . i < n. sends for 0. . A(i;. ) to. p1. using the. i. th message;. j < n. sends. . B( ; j). to. p1. using the (n + j ) th message;. end;. (a) begin for 0. . i < n. receives for 0. . A(i;. j < n. receives. ) from. . B( ; j). from. p0. p0. in the. i. th message;. in the (n + j ) th message;. //which triggers the computations for deriving. (b). end;. . C( ; j). Figure 1: (a) Algorithm M M for p0 ; (b) Algorithm MM for p1 In this section, three strategies MM , M M

(57) , and M M for designing MM algorithm are proposed and analyzed to illustrate our programming paradigm. Let E = fei g contains the data items of matrices A = [aij ] and B = [bij ], where ein+j is aij and en2 +in+j is bij for 0  i; j < n2 . Each of the algorithms employs di erent grouping functions for partitioning data items into di erent sets. The goruping functions group the data to 2n sets m0 ; m1 ; : : : ; m2n 1 . Each of the sets is of size n. The site p0 sends set mi using the i th message, for 0  i  2n 1. In the followings, several notations used in matrix multiplication will be de

(58) ned

(59) rst.. De

(60) nition 5 A(i; ) denotes all the elements in the i th row of matrix A and A(; j ) denotes all the elements in the j th column of matrix A. C k (i; j ) denotes the value of. Pk. l=0 C (i; l).  C (l; j ).. In algorithm MM , site p0 sends the elements of matrix A row by row, then sends the elements of matrix B column by column. Thus, the grouping function for MM is as follows:. (. (i) =. if 0  i < n2 if n2  i < 2n2. qni. n + rni. The operations performed in p0 and p1 are shown in Figure 1 In algorithm M M , p1 is able to perform matrix multiplication only after p0 begins to send the elements of matrix B . Based on the algorithm, we can calculate the computational

(61) llet f (i) amount at the computing site p1 as soon as the i th message received by the site p1 . Note that we count the operation of one addition plus one multiplication as one computation step which consumes one CPU cycle. The function f (i) is shown as follow:. (. f (i) =. for 0  i < n for n  i < 2n. 0; n2 ;. 8.

(62) f(i) n2. 0 1 2 ............................. n-1 n n+1 ......................2n-1. i. Figure 2: additional operations triggered at p1 as the i th message arrives for MM Figure 2 also illustrates the function f (i). From the

(63) gure, we observe that no computation can start for the

(64) rst n messages arrive at site p1 . However, n2 computations can be performed when each of the following n message arrives at the site p1 . In algorithm MM

(65) , p0 sends the elements of matrix A row by row alternating with the elements of matrix B column by column to computing site p1 . Thus, the grouping function for M M

(66) is as follows: ( i 2qn if 0  i < n2 (i) = 2rni + 1 if n2  i < 2n2 The operations performed in p0 and p1 are shown in Figure 3. In algorithm M M

(67) , p1 can perform matrix multiplications . p0 starts to send the elements of matrix B . Thus, by employing MM

(68) , the computations can be triggered earlier if compared with algorithm M M . Based on the grouping function for algorithm MM

(69) , we can calculate the computational

(70) llet f (i) at site p1 as soon as the i th message received by the site p1 . The function f (i) is shown as follow: f (i) = b(i + 1)=2c  n. f or 0  i < 2n. Figure 4 also illustrates the function f (i). From the

(71) gure, we can see that operations to be triggered is an increasing function of i. Assume g (i) is an decreasing function of i. Then, we can derive that the value of G((jMj 1) for algorithm M M is no less than that for algorithm M M

(72) even though they have the same value of F (jMj 1). Compared with MM , it does not delay the operations for the triggered computations to the end of the communication. In algorithm MM , p0 sends the elements of matrix A column by column alternating with the elements of matrix B row by row . Thus, the grouping function for MM is as follows:. (. (i) =. if 0  i < n2 if n2  i < 2n2. 2rni 2qni + 1 9.

(73) begin for 0. . i < n. f sends sends. A(i;. . ) to. B ( ; i). to. p1. using the. p1. using the (2i + 1) th message;. i. th message;. g. end;. (a) begin receives. A(0;. ) from.  f receives (. for 1. p0. in the 0 th message;. i < n. B. ;i. 1) from. p0. in the (2i. 1) th message;. //which triggers the computations for deriving receives. A(i;. ) from. p0. . B( ; n. 1) from. p0. 1) for 0. in the 2i th message;. //which triggers the computations for deriving receives. C (j; i. in the (2n. //which triggers the computation. . C( ; n. C (i; j ). for 0. . . j < i;. j < i;. g. 1) th message; 1);. end;. (b) Figure 3: (a) Algorithm M M

(74) for p0 ; (b) Algorithm MM

(75) for p1. f(i) n2 n(n-1). 2n n 0 1 2 3 4 ................................... 2n-3 2n-2 2n-1. i. Figure 4: additional operations triggered at p1 as the i th message arrives for MM

(76). 10.

(77) begin for 0. . i < n. f sends ( A. sends. ; i). B (i;. to. ) to. p1. using the 2i th message;. p1. using the (2i + 1) th message;. g. end;. (a) begin for 0. . i < n. f receives ( A. receives. ; i). B (i;. from. ) from. p0. in the 2i th message;. p0. in the (2i + 1) th message;. //which triggers the computations for deriving. C. i. (.  );g ;. end;. (b) Figure 5: (a) Algorithm M M for p0 ; (b) Algorithm MM for p1 It is shown in Figure 5. In algorithm MM , p1 can perform matrix multiplications as soon as p0 begins to send the elements of matrix B . Since the i th row of matrix B is sent followed by i th column of matrix A, the additional computations can be triggered by messages with odd index i. Based on the algorithm, we can calculate the quantities of computational

(78) llets. The function f (i) is shown as follow:. (. f (i) =. 0; n2 ;. if i is even and 0  i < 2n if i is odd and 0  i < 2n. Figure 6 also illustrates the function f (i) . From the

(79) gure, we can see that amount of computations to be triggered is alternating with 0 and n2 . Assume g (i) is an decreasing function of i. Then, we can derive that the value of G((jMj 1) for algorithm M M is no less than that of algorithms M M and MM

(80) , even though they have the same value of F (jMj 1). Compared with M M and MM

(81) , it does not delay the operations for the triggered computations to the end of the communication. The comparison among MM , M M

(82) and M M for matrices of size 3  3 is shown in Figure 7. In the

(83) gure, we can observe that if we set g (i) = 4:5 for 0  i < 5 then g (5)s are equal to 18, 12, and 9 for M M , MM

(84) and MM , respectively. Thus, if the system provides 4.5 CPU cycles/second for execution the task, then MM , M M

(85) and M M can

(86) nish at 9, 7.66, and 7 seconds, respectively. Let the ACFs for M M , MM

(87) and M M be denoted as F (i), F

(88) (i) and F (i) respectively and the ACGs for M M , MM

(89) and MM be denoted as G (i), G

(90) (i) and G (i) respectively. Then, we also have, F (i)  F

(91) (i)  F (i) for 0  i < jMj. and G (jMj 11. 1)  G

(92) (jMj. 1)  G (jMj. 1).

(93) f(i) n2. 0 1 2 3 ......... ................................... 2n-3 2n-2 2n-1. i. Figure 6: additional operations triggered at p1 as the i th message arrives for M M. number of additional computations f(i). MMα. 9 6 3. i_th messag 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. (a). number of additional computations f(i). MMβ. 9. 7.66. 6 3. i_th messag 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. (b). number of additional computations f(i). MMγ. 9 6 3. i_th messag 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. (c). Figure 7: Comparison among M M , M M

(94) and M M. 12.

(95) begin for. i. f. = 1 to uii. =. for. j. f. n. aii ;. = lji. i. + 1 to. =. aji =uii ;. uij. =. for. j. aji ;. =. ffor. i. + 1 to. k. =. ajk. return. L. and. n. n. + 1 to. i. =. ajk. n. ggg. lji uik ;. U. end;. Figure 8: Algorithm for LU Decomposition 5.2. LU-decomposition. The second example we will study is LU decomposition (LUD) which is used in solving systems of linear equations. First, we show the sequential algorithm which decomposes matrix A into upper triangular matrix U and lower triangular matrix L. The elements of A, L, U in row i and column j are denoted as aij , lij ,uij respectively. For the purpose of clarity, a sequential LUdecomposition algorithm for a system of single node is shown in Figure 8. Interested readers should refer to [3] for more details on the topic of solving systems of linear equations. Let E = fei g contains the data items of matrices A = [aij ], where ein+j is aij for 0  i; j < n2 . The scenario of the LUD of our algorithms is as follows: the elements of matrix A of size n  n stored at p0 are sent to p1 for calculating the matrices L and U . The

(96) rst algorithm we propose is Algorithm LU . It operates as follows: a data site p0 sends n messages of size n to a computing site p1 by row major. Thus, the grouping function for  is as follows. (i) = qni When the computing site p1 receives the messages, computations are triggered at site p1 . The elements of L and U are stored at D when the computations proceeds. The operations performed at p0 and p1 are shown in Figure 9. In algorithm LU , site p1 starts to perform computations on a row after p1 has received the second row. The function of computational

(97) llet is as follow: f (i) = (2n. i + 1)i=2. In the function, we count multiplication or division operations as one computation step. Figure 10 also illustrates the function f (i). From above equation, we can see that the amount of computations to be triggered increases as the value of i increases. The second algorithm we propose is Algorithm LU

(98) . It operates as follows: source site 13.

(99) begin for 0. . i < n. sends. A(i;. ) to. p1. using the. i. th message;. (a) begin for 0. . i < n. receives. A(i;. ) from. p0. in the. i. th message;. //which triggers the computations for deriving. D (i;. ),. end;. (b) Figure 9: (a) Algorithm LU for p0 ; (b) Algorithm LU for p1. f(i) (n+2)(n-1)/2. (2n-i)( i+1)/2 (2n-i+1) i/2 (2n-i+2) (i-1)/2. 2n-1 n. 0 1 2 ............................. i-1 i i+1 ......................n-1. i. Figure 10: additional operations triggered at p1 as the i th message arrives forLU. 14.

(100) begin for 0. . j < n. sends. . A( ; j ). to. p1. using the. th message;. j. (a) begin for 0. . j < n. receives. . A( ; j ). from. p0. in the. i. th message;. //which triggers the computations for deriving. . D ( ; j ),. end;. (b) Figure 11: (a) Algorithm LU

(101) for p0 ; (b) Algorithm LU

(102) for p1 p0 sends n messages of size n to computing site p1 by column major. When site p1 receives. the messages, computations are triggered in site p1 . Thus, the grouping function for LU is as follows. (i) = rni The elements of L and U are stored at D after the computations proceeds. The operations performed in sites p0 and p1 are shown in Figure 11. In algorithm LU

(103) , site p1 can start to perform computations on a column after p1 receives the

(104) rst column. The function of computation density is as follow: f (i) = (2n. i. 2)(i + 1)=2. Figure 12 illustrates the function f (i). From above equation, we can see that the amount of computations to be triggered increases as the value of i increases. The third algorithm is LU . In algorithm LU , data site p0 sends the elements of matrix A to computing site p1 , by alternating rows with columns. The

(105) rst message sends the column of A Thus, the grouping function for LU is given as as follows.. 8 P > rn > > > > < P (i) = rn > > > > > : rnP (. (. (. i rn (1+2( k=1. n k)))+1. if qni = rni. n k)))+qni rni +1. if qni > rni. n k)))+n 2qni +rni. if qni < rni. i rn (1+2( k=1. i qn (1+2( k=1. The operations performed in sites p0 and p1 are shown in Figure 13. In the

(106) gure, mi is de

(107) ned 15.

(108) f(i) (n-1)n/2. (2n-i-3)( i+2)/2 (2n-i-2)( i+1)/2 (2n-i-1) i/2 3n-6 2n-3 n-1. 0 1 2 ............................. i-1 i i+1 ......................n-1. i. Figure 12: additional operations triggered at p1 as the i th message arrives for LU

(109) begin for 0. . i < n. sends. A. to. using the. p1. i. th message. mi ;. end;. (a) begin for 0. . i < n. freceives the. i. th message. mi. from. p0. g. //which triggers the computations for deriving //where. j. . k < n. for 0. . A(i; k ),. j < i. end;. (b) Figure 13: (a) Algorithm LU for p0 ; (b) Algorithm LU for p1 by the grouping function. In algorithm LU , site p1 can start to perform computations after site p1 has received rows or columns. For analysis purpose, we repartition the data items into (3n 2) sets of various sizes. The sets are S0 ; Sl ; S2 ; : : : ; S3n 3 , where S3i = faii g, S3i+1 = fai+1;i ; ai+2;i ; ai+3;i ; : : : ; an 1;ig and S3i+2 = fai;i+1 ; ai;i+2 ; ai;i+3 ; : : : ; ai;n 1 g. Partitioning the elements of the matrix A = [aij ] of size 9  9 is shown in Figure 14. The number in each of the entries is the index of a set to which the element aij . For example, a2;1 belongs to set S4 . 16.

(110) 0. 2. 2. 2. 2. 2. 2. 2. 2. 1. 3. 5. 5. 5. 5. 5. 5. 5. 1. 4. 6. 8. 8. 8. 8. 8. 8. 1. 4. 7. 9. 11. 11. 11. 11. 11. 1. 4. 7. 10. 12. 14. 14. 14. 14. 1. 4. 7. 10. 13. 15. 17. 17. 17. 1. 4. 7. 10. 13. 16. 18. 20. 20. 1. 4. 7. 10. 13. 16. 19. 21. 23. 1. 4. 7. 10. 13. 16. 19. 22. 24. Figure 14: repartition the data items of matrix A into 25 sets of various sizes f(i) (n-1)2 (n-2)2. n-1 n-2 1. 0 1 2 3 4 5 ....................... 3(n-1)+2 3( n-1)+1 3( n-1). i. Figure 15: additional operations triggered at p1 as the i th message arrives for LU The function of computational

(111) llets based on the as follows: 8 > 0 < f (i) = (n q3i 1) > : (n q3i 1)2. new sets S0 ; Sl ; S2 ; : : : ; S3n. 3. is de

(112) ned. if r3i = 0 if r3i = 1 if r3i = 2. Figure 15 also illustrates the function f (i). we can see that the amount of computations triggered tends to decreases as the value of i increases. Assume g (i) is an decreasing function of i. Then, we can derive that the value of G((jMj 1) for algorithm LU is no less than that of algorithms LU and LU

(113) , even though 17.

(114) 0. 1. 2. 3. 4. 5. 6. 7. 8. # of additional computations. i_th message. 0. 9. 17. 24. 30. 35. 39. 42. 44. # of accumulated computations. 0. 9. 26. 50. 80. 115 154 196 240. (a). 0. 1. 2. 3. 4. 5. 6. 7. 8. # of additional computati ns. i_th message. 8. 15. 21. 26. 30. 33. 35. 36. 36. # of accumulated computations. 8. 23. 44. 70. 100 133 168 204 240. (b). 0. 1. 2. 3. 4. 5. 6. 7. 8. # of additional computations. i_th message. 8. 64. 21. 38. 39. 20. 22. 20. 8. # of accumulated computations. 8. 72. 93. 131 170 190 212 232 240. (c). Figure 16: comparision among LU , LU

(115) and LU for matrix of size 9  9 they have the same value of F (jMj 1). Compared with M M and MM

(116) , it does not delay the operations for the triggered computations to the end of the communication. The comparison among LU , LU

(117) and LU for matrices of size 9  9 is shown in Figure 16 and Figure 17 . In the

(118) gure, we can observe that if we set g (i) = 30 for 0  i < 8 then g (8)s are equal to 70, 50, and 22 for LU , LU

(119) and LU , respectively. Thus, if the system provides 30 CPU cycles/second for execution the task, then LU , LU

(120) and LU can

(121) nish at 10.33, 9.67, and 8.73 seconds, respectively. .. 6 Conclusion In this paper; an programming paradigm is proposed to maximize the utilization of the computing nodes. The approach employs a grouping function on the data locally at the data site before the data items is sent to the computing site. In a mobile computing environment, the network connection may be disconnected for a while. Our result can also be applied to keep the computing device busy for performing useful computations.. 18.

(122) f(i) 60. LUα. 50 40 30 20 10. 10.33. i_th message 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. (a). f(i) 60. LUβ. 50 40 30 20 10. 9.67. i_th message 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. (b). f(i) 60. LUγ. 50 40 30 20 10. 8.73. i_th message 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. (c). Figure 17: comparision among LU , LU

(123) and LU for matrix of size 9  9. 19.

(124) References [1] P. B. Bhat, V. K. Prasanna, and C. S. Raghavendra,"Adaptive Communication Algorithms for Distributed Heterogeneous Systems," Proc. of International Symposium on High Performance Distributed Computing, 1998. [2] Y. Chung, V. K. Prasanna and C.-L. Wang, "Parallel Algorithms for Linear Application on Distributed Memory Machine," Proc. of DARPA Image Understanding Workshop, 1996. [3] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to Algorithms, The MIT Press; 1990. [4] J. J. Dongarra, "Performance of Various Computers Using Standard Linear Equations Software, (Linpack Benchmark Report)," University of Tennessee, Computer Science Technical Report, CS-89-58, 2002. [5] I. Foster, C. Kesselman, J. and S. Tuecke, "The Anatomy of the Grid: Enabling Scalable Virtual Organizations," International J. of Supercomputer Applications, 15(3), 2001. [6] T. P. Green. DQS User Interface. Technical Report, Supercomputer Computations Research Institute, Florida State University, March 1996. [7] J. L. Hennessy and D. A. Patterson, Computer Architecture - A Quantitative Approach, Morgan Kaufrnann, 1990. [8] IEEE Symposium Record - Hot Chips IV, August 1992. [9] Tsan-sheng Hsu, Joseph C. Lee, Dian Rae Lopez and William A. Royce, "Task Allocation on a Network of Processors," IEEE Transactions on Computers, volume 49, number 12, pages 1339{1353, 2000. [10] H. J. Siegel et al., "Report of the Purdue Workshop on Grand Challenges in Computer Architecture for the support of High Performance Computing," J. of Parallel and Distributed Computing, 1992. [11] S.-H. Yeh and' J.-J. Wu, "EÆcient all-to-all broadcast in heterogeneous network of workstations," Proc. of International ComputerSymposium, Dec. 2000.. 20.

(125)

數據

Figure 2: additional operations triggered at p
Figure 3: (a) Algorithm M M
Figure 7: Comparison among M M
Figure 8: Algorithm for LU Decomposition
+7

參考文獻

相關文件

Binding Warning message Binding Update message AAAO: the AAA server of the old foreign network to which the OFA belongs. AAAF: the AAA server of the new foreign network to which the

Following the supply by the school of a copy of personal data in compliance with a data access request, the requestor is entitled to ask for correction of the personal data

„ Start with a STUN header, followed by a STUN payload (which is a series of STUN attributes depending on the message type).

¾ To fetch a Web page, browser establishes TCP connection to the machine where the page is and sends a message over the connection asking for the

A series of eight Key Learning Area (KLA) Curriculum Guides (Primary 1 to Secondary 3) and the General Studies (GS) for Primary Schools Curriculum Guide (Primary 1-6) have

* All rights reserved, Tei-Wei Kuo, National Taiwan University, 2005..

The MTMH problem is divided into three subproblems which are separately solved in the following three stages: (1) find a minimum set of tag SNPs based on pairwise perfect LD

• Instead of uploading and downloading the dat a from cloud to client for computing , we shou ld directly computing on the cloud ( public syst em ) to save data transferring time.